PB55-143770 


L.i  \.J  LJ  L-'. 

Information  S3  oar  fcu 


STANFORD  UNIV. ,  CA 


'oismiuBow  rrATFi^ffT"'' , 

AgfKWBd  for  psMfe  tetecm.  ! 
WctrtiSHsfte®  CyJte'Mted 


u.s.  department  of  cotojsefice 

National  Technical  Information  Service 


me  QUALITY  IIJ2PE0TED  S 


Report  No.  STAN-CS-91-1361 

thesis 


PB96 - 148770 


Convergence  Bounds  for  Markov  Chains  and 
Applications  to  Sampling 

by 

Anil  Ramesh  Gangolli 


Department  of  Computer  Science 

Stanford  University 
Stanford,  California  94305 


Rtntooucco  «*  K!9 


Convergence  Bounds  for  Markov  Chains  and  Application, 
to  Sampling 


AU 

Anil  Ramesh  Gangolli 


Q&s&PiUAWn  mui&i  ms  A&tms&a&i 

Computer  Science  Department, 

Stanford  University 


$•  &  4  ^ f  v %  (U  $ CaA&i&A  I  iOH 
K?S3T  WS32S3 


9.  wcajcfttaa/MowifoaMja  a«*«c r  naMitiMoo  Atom 

DARPA  /  ONR 
1400  Wilson  Blvd. 

Arlington  VA  22209 


tt.  SumtMlMTARY  NOUS 


1lA.  &lStftl8UTlOW/AVAOAeiUtT  STAYiMEft 


td. /  ssoftifettKa 

agsbsv  mmm  mama 


lift.  GSSYftSUlflGM  CQUt 


UNLIMITED 


ii.  attract  <A«*«imviii  immsm 


Consider  a  discrete-time  ergodic  Markov  chain  on  a  finite  state  space  5  with  stationary  distribution  x. 
By  simulating  such  a  chain,  it  is  possible  to  draw  random  samples  from  5  that  have  distribution  x 
or  nearly  x.  This  thesis  treats  some  basic  questions  that  arise  when  one  wants  to  apply  such  a 
sampling  method  in  a  rigorous  way. 

We  begin  by  reviewing  recently  developed  techniques  for  proving  convergence  bounds  for  Markov 
chains,  and  give  some  new  convergence  bounds  for  a  number  of  chains  related  to  “urn  models  *  We 
then  exhibit  tight  spectral  bounds  on  the  variance  of  natural  mean-value  estimators  computed  from 
a  time-reversible  Markov  chain,  and  we  use  these  bounds  to  study  issues  of  efficiency  when  com¬ 
puting  mean-value  estimates  by  this  method.  Combining  the  variance  bounds  with  a  construction 
of  expander  graphs,  we  obtain  an  efficent  pseudo-random  generator  for  mean-value  estimation.  Fi¬ 
nally,  we  present  some  experimental  results  obtained  using  the  Markov  chain  sampling  method  on 
a  statistical  problem. 


14.  iUUJiCt  TIA&1S' 

Analysis  of  Algorithms,  Combinatorial  Mathematics,  Statistics 


19.  CLASSY  (CAYK2SI 

G9  AS9TSACT 


ii.  mmm  m  pa^is 
153 


fm  umi&uQM  &  asstsact 


/ 


CONVERGENCE  BOUNDS  FOR  MARKOV  CHAINS  AND 
APPLICATIONS  TO  SAMPLING 


A  DISSERTATION 

SUBMITTED  TO  THE  DEPARTMENT  OF  COMPUTER  SCIENCE 

AND  THE  COMMITTEE  ON  GRADUATE  STUDIES 

OF  STANFORD  UNIVERSITY 

IN  PARTIAL  FULFILLMENT  OF  THE  REQUIREMENTS 

FOR  THE  DEGREE  OF 
DOCTOR  OF  PHILOSOPHY 


By 

Anil  Ramesh  Gangolli 
May  1991 


I  certify  that  I  have  read  this  dissertation  and  that  in  my  opinion 
it  is  fuDy  adequate,  in  scope  and  in  quality,  as  a  dissertation  for 
the  degree  of  Doctor  of  Philosophy. 


Persi  Diaconis 
(Principal  Advisor) 


I  certify  that  I  have  read  this  dissertation  and  that  in  my  opinion 
it  is  fully  adequate,  in  scope  and  in  quality,  as  a  dissertation  for 
the  degree  of  Doctor  of  Philosophy. 


Donald  E.  Knuth 


I  certify  that  I  have  read  this  dissertation  and  that  in  my  opinion 
it  is  fully  adequate,  in  scope  and  in  quality,  as  a  dissertation  foi 
the  degree  of  Doctor  of  Philosophy. 


Rajeev  Motwani 


Approved  for  the  University  Committee  on  Graduate  Studies: 


Dean  of  Graduate  Studies 


DTIO  QUALITY  IHSPEOTED  3 


Abstract 


Consider  a  discrete- time  ergodic  Markov  chain  on  a  finite  state  space  5  with  stationary  distribution  t. 
By  simulating  such  a  chain,  it  is  possible  to  draw  random  samples  from  5  that  have  distribution  x 
or  nearly  r.  This  thesis  treats  some  basic  questions  that  arise  when  one  wants  to  apply  such  a 
sampling  method  in  a  rigorous  way. 

We  begin  by  reviewing  recently  developed  techniques  for  proving  convergence  bounds  for  Markov 
chains,  and  give  some  new  convergence  bounds  for  a  number  of  chains  related  to  *urn  models.*  We 
then  exhibit  tight  spectral  bounds  on  the  variance  of  natural  mean-value  estimators  computed  from 
a  time-reversible  Markov  chain,  and  we  use  these  bounds  to  study  issues  of  efficiency  when  com¬ 
puting  mean-value  estimates  by  this  method.  Combining  the  variance  bounds  with  a  construction 
of  expander  graphs,  we  obtain  an  efRcent  pseudo-random  generator  for  mean- value  estimation,  li* 
nally,  we  present  some  experimental  results  obtained  using  the  Markov  chain  sampling  method  on 
a  statistical  problem.  , 


Acknowledgements 

When  I’m  not  thank’d  at  all.  I’m  thank’d  enough. 

I've  done  my  duty,  and  I’ve  done  no  more. 

—Henry  Fielding  (1707-1754) 

How  sharper  than  a  serpent's  tooth  It  Is 

To  have  a  thankless  child! 

-William  5hakespeare  (1564-1754),  King  Lear 

In  1985,  Persi  Diaconis  gave  a  truly  inspiring  series  of  lectures  that  introduced  to  me  many  of 
the  ideas  that  underlie  this  work.  He  has  since  been  my  faithful  advisor,  friend,  hero,  and  academic 
father.  1  cannot  thank  him  enough  for  his  patient  guidance. 

Don  Knuth  is  the  reason  1  flunked  my  comprehensive  exams  on  the  initial  attempt.  He  led  a 
programming  and  problem  solving  seminar  that  I  attended  during  my  first  year  here.  It  was  so 
fun,  so  fascinating,  and  so  engrossing  that  I  spent  all  my  spare  time  thinking  about  those  problems 
and  neglected  to  study  any  of  my  weak  “comp"  areas.  (With  more  courses  like  that,  we  wouldn’t 
need  exams!)  Over  the  period  in  which  I  did  this  work,  Don  not  only  provided  me  with  continuing 
inspiration  and  little  bits  of  great  advice,  he  arranged  my  research  funding  and  diligently  kept  the 
‘Black  Friday’  committee  at  bay.  His  careful  proofreading  of  this  text  has  prevented  several  errors 
from  reaching  publication. 

Rajeev  Motwani  graciously  agreed,  when  asked  somewhat  late  in  the  game,  to  be  the  third  reader 
on  my  thesis.  I  would  particularly  like  to  thank  him  for  his  suggestions  for  improving  Chapter  4. 

David  Aldous,  Andrei  Broder,  Pang  Chen,  Jim  Fill,  Arif  Merchant,  and  Alistair  Sinclair  have  all 
shared  their  ideas  on  Markov  chain  methods  with  me.  I  really  enjoyed  the  company  and  support  of 
this  hard-working  community,  especially  during  our  summer  seminars  at  Stanford.  I  would  also  like 
to  thank  Tomas  Feder,  Seffi  Naor,  and  Yossi  Asar  for  discussing  various  ideas  with  me. 

The  National  Science  Foundation  (grant  numbers  DCR-83-51757  and  CCR-8610181-A2)  and 
the  Office  of  Naval  Research  (grant  numbers  N00014-87-K-0502  and  N00014-88-K-0166)  sustained 
various  parts  of  this  research.  Some  of  the  computing  facilities  used  in  this  work  were  donated  to 

▼ 


the  Computer  Science  Department  by  AT&T,  by  Digital  Equipment  Corporation,  and  by  IBM. 

Because  this  marks  the  culmination  (I  think)  of  my  formal  education,  it  is  also  appropriate  here 
to  thank  the  numerous  other  teachers  who  have  guided  me  along  the  way.  I’d  like  particularly  to 
thank  Jim  Erikson,  E.  Lee  Stout,  Tom  Cover,  and  Bob  Floyd.  Ron  Graham  and  Oren  Patashnik 
deserve  special  mention  for  exciting  me  about  discrete  mathematics  with  their  brilliant  performance 
of  CS151  Concrete  Mathematics  in  1981,  my  first  undergraduate  year  here. 

Since  1991  -  1984  =  7,  it  takes  little  imagination  to  guess  that  I  was  not  working  on  my  thesis 
research  all  of  the  time.  Critical  at  the  other  times  were  my  good  friends  Thane  Plambeck,  C.  Greg 
Plaxton,  Yossi  Friedman,  Ashok  Subramanian,  John  Woodfill,  Andrew  Kosoresow,  Evan  Cohn, 
Ramsey  Haddad,  Sherry  Listgarten,  Gidi  Avrahami,  R.  Michael  Young,  and  the  rest  of  the  su.roger- 
or-andy  gang.  Gee,  what  a  smart  bunch!  I  would  especially  like  to  thank  two  very  special  friends, 
Elisabeth  JafFe  and  Carrie  Shook,  who  gave  me  immeasurable  emotional  support  and  encouragement 
during  these  years. 

Finally,  let  me  record  a  necessarily  inadequate  “thank  you”  to  my  parents,  Ramesh  and  Shanta 
Gangolli,  and  to  the  rest  of  my  family  for  all  of  their  support  and  faith. 

Anil  Gangolli 

May  1991  at  Stanford,  California. 


Introduction 


Everyone  understands,  intuitively,  that  when  we  shuffle  a  deck  of  cards,  we  do  so  in  order  to  get 
a  random  ordering  of  the  cards.  We  are  using  a  (presumably)  stochastic  process  whose  states  are 
orderings  of  the  cards  to  get  to  an  ordering  that  is  random  and  does  not  depend  on  the  original 
order  of  the  cards. 

At  least  as  early  as  1953,  Metropolis  et  of.  [MRR+53]  suggested  the  Mowing  similar  computa¬ 
tional  technique  for  sampling  from  a  set.  The  distribution  of  an  ergodic  Markov  chain  on  its  state 
space  converges  to  a  unique  stationary  distribution  *  that  does  not  depend  on  the  initial  distribu- 
tion.  So  to  sample  according  to  w,  simulate  a  Markov  chain  with  stationary  distribution  *  until  it 
is  near  stationarity,  and  then  draw  samples  from  the  chain. 

To  use  this  method  in  a  rigorous  way,  one  must  answer  questions  like:  'How  long  does  it  take  for 
the  distribution  of  the  chain  to  reach  (or  be  near)  the  stationary  distribution  *?’  ‘How  do  samples 
drawn  {ton  the  chain  perform  (compared  to  independent  w-distributed  samples)?’ 

The  classical  theory  of  Markov  processes  offers  little  in  the  way  of  non-asymptotic  answers  to 
these  questions.  The  role  of  the  spectrnm  of  the  transition  matrix  in  answering  such  questions  has 
been  known  for  some  time,  but  until  recently  this  relation  has  not  been  very  useful  because  of  the 
lack  of  good  bounds  on  the  spectrum.  It  is  only  in  the  last  ten  years  that  people  have  developed 
useful  techniques  to  answer  these  questions  for  the  large  sparse  chains  that  one  faces  in  practice. 
This  thesis  presents  some  contributions  to  this  field. 

In  the  first  chapter,  we  present  an  overview  of  some  new  techniques  for  bounding  convergence 
rates  of  Markov  chains,  and  in  particular  for  the  class  of  time-reversible  Markov  chains.  This  includes 
the  class  of  random  walks  on  undirected  graphs,  which  have  numerous  applications  in  theoretical 
computer  science.  We  devote  particular  attention  to  spectral  bounds,  based  on  recent  geometric 
arguments  to  bound  the  second  largest  eigenvalue.  These  bounds  relate  connectivity  and  ‘expansion’ 
properties  of  the  underlying  graph  to  bounds  on  the  eigenvalues,  and  therefore  to  bounds  on  the 
convergence  rate. 

In  the  second  chapter  we  use  these  techniques  to  prove  bounds  for  two  classical  ‘urn  models, 
and  some  direct  generalisations.  These  lead  us  naturally  to  consider  random  walks  induced  by 
group  actions.  There  has  been  a  good  deal  of  success  recently  in  using  harmonic  analysis  to  obtain 


vii 


convergence  bounds  for  such  processes.  In  order  to  be  tractable,  that  type  of  analysis  requires  special 
properties  beyond  that  of  the  group  or  group  action.  At  the  cost  of  obtaining  poorer  bounds,  we  take 
a  different  approach  that  does  not  require  special  additional  structure.  We  obtain  new  diameter- 
based  bounds  on  the  expansion  of  the  Cayley  graph  of  a  group  action.  This  yields  diameter  and 
volume  based  bounds  on  the  convergence  rate  of  certain  random  walks  based  on  groups. 

Estimating  the  mean  value  of  a  function  on  a  set  is  arguably  the  most  common  application  of 
sampling.  In  Chapter  3  we  investigate  the  natural  mean-value  estimators  on  the  Markov  chain:  sam¬ 
ple  means  based  on  samples  drawn  some  t  steps  apart  from  the  stationary  (or  near-stationary)  chain. 
We  prove  tight  worst-case  bounds  on  the  variance  of  such  sample  means.  It  is  not  hard  to  see  that 
if  t  is  large,  so  that  we  draw  our  samples  as  far  apart  as  the  time  required  to  reach  stationarity  from 
any  initial  position,  then  the  samples  are  essentially  independent,  and  have  properties  approximat¬ 
ing  independent  samples.  Drawing  samples  so  far  apart,  however,  seems  to  ‘waste  the  information’ 
in  the  intervening  states,  which  we  ‘pay  for*  by  simulating  numerous  steps  of  the  chain.  However, 
samples  drawn  at  frequent  intervals  will  be  correlated,  and  will  tend  to  increase  the  variance  of 
the  estimates;  which  method  will  be  better?  The  results  show  something  surprising.  Independent 
samples  give  no  smaller  variance  than  a  comparable  number  of  samples  drawn  at  a  spacing  t  that  is 
typically  much  smaller  than  the  time  required  to  reach  stationarity.  The  two  subsequent  chapters 
are  applications  of  this  idea. 

In  Chapter  4,  the  results  on  estimation  are  apj  lied  together  with  a  construction  of  a  family  of 
expander  graphs  to  show  that,  from  a  very  small  number  of  random  bits,  one  can  generate  a  large 
set  of  random  binary  words  that  are  essentially  as  good  as  true  independent  uniform  random  words 
for  estimating  mean  values  of  real- valued  functions. 

In  our  fifth  chapter,  we  consider  a  problem  that  arises  in  the  analysis  of  multivariate  statistical 
data:  estimating  the  significance  of  two-way  ‘contingency  tables’  under  the  uniform  distribution. 
Contingency  tables  are  arrays  with  nonnegative  integer  entries  whose  row  and  column  sums  are 
prescribed  values.  These  tables  arise  in  a  surprising  variety  of  settings  in  the  theory  of  the  sym¬ 
metric  group  and  in  statistics.  They  have  drawn  considerable  attention  from  combinatorialists  and 
statisticians.  It  is  unknown  how  to  count  the  exact  number  of  tables  with  given  row  and  column 
sums,  which  suggests  that  one  will  not  easily  find  a  traditional  method  of  sampling  from  the  set. 

We  suggest  a  random  walk  that  provably  converges  to  the  uniform  distribution  on  the  set.  This 
yields  algorithms  to  estimate  significance  values  and  to  approximately  count  the  set.  The  algorithms 
will  give  good  performance  in  polynomial  time  provided  that  the  eigenvalues  of  the  chain  can  be 
suitably  bounded. 

We  are  able  to  prove  polynomial- time  bounds  for  a  slight  modification  of  our  suggested  algorithm 
on  a  certain  well-behaved  class  of  instances.  But  these  bounds  do  not  hold  for  all  instances,  and  art 
too  weak  to  imply  practical  running  times. 

We  conjecture  that  better  bounds  actually  hold,  based  on  computational  experience,  and  also 


viii 


v- 


/ 


/ 


provide  some  theoretical  motivation  for  these  conjectures  The  accuracy  of  various  significance 
estimates  obtained  using  the  walk  compare  well  to  values  obtained  by  exact  computations  for  some 
cases,  and  also  to  results  obtained  by  other  means  of  approximation.  These  experiments  demonstrate 
the  practicality  of  the  suggesed  techniques  and  show  that  the  walk  seems,  m  practice,  to  display 
the  conjectured  convergence  properties. 


ix 


Notation  and  Conventions 


We  use  the  following  notations  and  conventions  frequently.  Note,  in  particular,  the  definition  of 

graphs  that  we  use. 

Notation 

R  denotes  the  real  numbers. 

Z  denotes  the  integers,  Zm  the  integers  modulo  m,  and  N  denotes  the  nonnegative  integers. 

|x|  denotes  cardinality  if  x  is  a  set,  absolute  value  if  x  is  a  number,  length  if  x  is  a  string. 

[n]  denotes  the  set  of  positive  integers  {1, 2, 3, ,  n}. 

In  n  denotes  the  natural  (base  e)  logarithm,  while  lg  n  denotes  the  logarithm  base  2. 

U  usually  denotes  the  uniform  distribution  on  the  set  under  discussion,  usually  the  state  space  V 
of  a  Markov  chain.  =  1/|V|  for  each  v  £  V. 

P(A)  when  P  is  a  probability  distribution  on  5,  and  A  C  5,  denotes  the  total  probability  of  the 
subset  A:  P(A)  =  P(a)« 

|| p  -  Q\\  is  the  total  variation  between  the  distributions  P  and  (?.  Total  variation  is  a  metric  on 
the  space  of  probability  distributions  on  a  set.  See  Appendix  A. 

rpd(P,Q)  denotes  the  relative  pointwise  distance  of  P  from  Q.  See  Appendix  A 

sep (P,  Q)  denotes  the  separation  distance  of  P  from  Q,  another  measure  of  distance  between  distri¬ 
butions.  See  Appendix  A. 

Pg]  denotes  the  (t,  j)  entry  of  the  matrix  P*.  That  is,  first  raise  P  to  the  power  k,  then  take  the 
(ij)  entry.  The  notation  P*  without  the  parentheses  means  (P*)*,  the  (ij)  entry  raised  alone 
to  the  kth  power. 

t*  the  kth-step  distribution  of  the  Markov  chain  under  discussion,  given  by  x*  =  *oP*- 


x 


L(P)  the  Laplacian  operator  associated  to  the  chain  P ,  namely  L  —  I  —  P.  We  simply  write  L  when 
P  is  understood  from  context. 

Q(G)  for  a  graph  G  (see  below)  is  the  matrix  Q  =  D  -  A,  where  D  is  the  diagonal  matrix  of 
degrees  D„  =  deg  v,  £>„,  =  0  for  v  t  w,  and  A  is  the  adjacency  matrix  of  G.  This  is  the 
graphical  Laplacian  that  is  used  by  Alon  [AI086]  and  others  in  the  purely  graph-theoretic 
setting.  We  will  simply  write  Q  when  G  is  understood  from  context.  When  G  is  a  d  regular 
graph,  and  P  the  natural  random  walk  on  G,  the  Laplacian  L(P)  is  related  to  Q(G)  via 
L(P)  =  i<?(G). 

(dl.VOw  ,  for  real- valued  functions  on  V,  and  measure  x  with  support  everywhere  on  V,  denotes 
the  quadratic  form  based  on  L  under  the  inner  product  (d,  =  £.  d(f  (v)»(t>).  That  is, 

(<pL,rl>)r  =  If  L  is  the  Laplacian  of  a  reversible  ergodic  chain,  then  L  is 

self-adjoint  in  this  inner-product  space. 

Aj  denotes  the  second-largest  eigenvalue  of  the  transition  matrix  for  the  chain  under  discussion. 

A.  is  the  second  largest  among  the  absolute  values  of  the  eigenvalues  of  the  chain  under  discussion. 
This  should  not  be  confused  with  |Ai|.  The  two  sometimes,  but  not  always,  coincide. 

Mi  denotes  the  smallest  strictly  positive  eigenvalue  of  the  Laplacian  L{P)  for  the  chain  under  dis- 
cussion 

denotes  the  smallest  strictly  positive  eigenvalue  of  Q(G)  for  the  graph  under  discussion,  generally 
the  underlying  graph  of  a  reversible  chain. 

Tr[Af]  for  a  matrix  M  denotes  its  trace,  which  is  the  sum  of  its  diagonal  entries,  and  this  is  equal 
to  the  sum  of  it  eigenvalues. 

Ere  is  the  set  of  all  m  x  n  nonnegative  integer  tables  with  row  sums  r  =  (ri.rj, . .  .,rm)  and  column 
sums  c  =  (cx,  cj, . . . ,  Cn),  where  £j  r«  =  £,  Cj  =  N. 

Binomial(n.p)  denotes  the  binomial  distribution  with  parameters  n  and  p.  The  discrete  random 
variable  A'  is  distributed  Binomial(n,p)  when  Pr{A  =  k}  =  C)p*(l  -  p)(m"*>.  For  further 
background,  see  [Fel70,  Vol.  1,  Chapter  VI.]. 

Other  Conventions 

We  use  boldface  to  mark  the  definition  of  a  new  term.  When  there  is  little  danger  of  confusion,  we 

omit  commas  between  multiple  subscripts.  Thus  denotes  P*,f  • 

Lemmas,  theorems,  examples,  and  figures,  are  numbered  in  a  common  sequence  within  chapters. 

So  Example  2.1  would  precede  Lemma  2.2,  which  would  precede  Theorem  2.4,  which  would  precede 


xi 


Figure  2.5.  All  would  he  found  in  Chapter  2.  Numbered  equations  arc  numbered  in  a  separate 
tequerce  within  chapters. 

We  use  a  slightly  nonstandard  definition  of  a  graph.  For  ns  a  graph  means  a  finite  undirected 
multigraph,  with  self  loops  allowed.  We  call  a  graph  simple  if  it  is  a  graph  in  the  usual  sense, 
without  self-loops  or  multiple  edges.  Every  graph  G  is  naturally  associated  with  a  nonnegative 
integer  rymmetric  matrix  A,  where  A*j  is  the  number  of  edges  between  the  vertices  i  and  j.  This 
is  the  adjacency  matrix  of  C.  For  a  vertex  v,  its  degree  is  defined  dege  s  ^  A*,  =  A,/. 

Note  that  this  counts  self-loop  edges  only  once.  This  means  that  |J£|  <  degv  <  2|£],  where 
we  consider  E  as  the  multiset  of  edges.  The  latter  is  an  equality  precisely  when  the  graph  has 
no  self-loops.  W>  call  a  graph  regular  if  every  vertex  has  the  same  degree.  For  d-regular  graphs 
without  self  loops  we  have  cf| V|  =  2|J£|.  We  call  a  graph  bipartite  if  it  has  no  odd-length  cycles.  A 
self-loop  constitutes  an  odd- length  cycle  by  itself '. 

In  a  graph  G  =  (V,  £),  if  A  C  Vt  then 

Nbd(A)  =  {to  |  to  i  A  and  (3v  €  A)[(v,u;)  €  E] }. 

This  is  the  set  of  vertices  that  are  neighbors  of  vertices  in  A  but  art  not  themselves  in  A.  Note  that 
a  vertex  appears  only  once  in  Nbd(A),  though  multiple  edges  from  A  may  reach  it. 


Contents 


Abstract 

Acknowledgement* 

Introduction 


Notation  and  Convention* 


1  Techniques  for  Bounding  Convergence  Rates 

1.1  Ergodic  Markov  Chains . 

1.2  Time- Reversibility . 

1.3  Random  Walks  on  Graphs . 

1.4  Convergence  in  Terms  of  the  Spectrum  .  .  .  . 

1.4.1  Bounds  for  Reversible  Chains . 

1.4.2  Strong  Aperiodicity  . 

1.4.3  Revcrsibilisation . 

1.5  Geometric  Eigenvalue  Bound* . 

1.5.1  The  Lapladan  . 

1.5.2  Cheeger-Type  Bounds . 

1.5.3  Canonical  Path  Argument* . 

1.5.4  Poincar^type  Bound* . 

1.6  Probabilistic  Bound* . 

1.6.1  Coupling* . 

1.6.2  Strong  Stationary  Times . 


1 

1 

3 

5 

7 

7 

9 

9 

11 

11 

14 

16 

19 

22 

22 

24 


2  Urn  Models  and  Group  Action* 

2.1  Ehrenfest-type  Model* . 

2.1.1  Arbitrary  Step* . 

2.1.2  Adjacent  Move* . 


xm 


2.2  Bernoulli-Laptace-type  Models  .  34 

2.2.1  A  Coupling  for  Bernoulli- Laplace  Processes .  34 

2.2.2  Tight  Analysis  cf  the  Two-Urn  Case .  36 

2.2.3  Weak  Analysis  of  the  General  Case . *  .  .  .  36 

2.3  Markov  Chains  based  on  Groups .  38 

2.3.1  Transitive  Group  Actions . . . . .  38 

2.3.2  Cayley  Graphs .  40 

2.3.3  Vertex-Transitive  Graphs . . .  41 

2.3.4  Chains  Based  on  Groups . 42 

2.3.5  On  the  Harmonic  Analysis  Approach . . .  45 

2.3.6  Magnification  Bounds .  46 

2.3.7  Eigenvalue  Bounds  for  the  Chains  .  49 

2.3.8  Examples .  50 

&  Mean* Value  Estimation  53 

3.1  Variance  Bounds . 56 

3.2  Sampling  to  Achieve  Given  Variance . 61 

3.3  Indicator  Functions .  65 

3.4  Central  Limit  Theorem  .  67 

4  Using  Expanders  in  Estimation  60 

4.1  Preliminaries . 70 

4.2  Outline  of  the  Algorithm . 72 

4.3  Sample  Means  from  Gn .  74 

4.4  Majorities  from  Gn . . .  75 

4.5  Combining  the  Results . 78 

4.6  The  Implied  Algorithm . .  .  78 

4.7  Discussion .  78 

5  Estimating  the  Significance  of  Contingency  Tables  83 

5.1  A  Random  Walk  on  Ere .  35 

5.2  Bigger  Steps . 67 

5.3  Eigenvalue  Hypothesis .  65 

5.4  Experimental  Results  . .  90 

5.4.1  Background . 01 

5.4.2  Pinckney  Gag  Rule .  92 

5.4.3  Hair  Color  v.  Eye  Color .  94 

5.4.4  Irregular  Margins . 06 


»▼ 


5.4.5  Scaling  . 

5.5  P lovably  Polynomial -Time  Methods 

5.6  More  on  the  Eigenvalue  Hypothesis 


101 

101 

105 


6  Directions  for  Future  Work 

A  Notions  of  Approximation 

A.l  Point  Approximations . 

A. 2  Approximate  Distributions . 

A.2.1  Total  Variation . 

A. 2.2  Separation  and  Relative  Pointwise  Distance 

A. 2. 3  Approximation  within  Ratio . 

A.2.4  Kolmogorov- Smirnov  Distance  . 


107 

109 

109 

110 
110 
no 
in 
112 


B  Sampling  from  Near- Stationary  Chains 

B.l  Nearly  Independent.  Near- St  at  ion  ary  Samples  . 

B.2  Other  Near- Stationary  Samples . 

B.3  On  Near-Independence  and  the  Median  Lemma 


113 

114 

115 

116 


C  Enumerating  Contingency  Tables 

C.l  Classical  Counting  Approaches . 

C.1.1  Exhaustive  Enumeration . 

C.1.2  A  Recursive  Formula . 

C.1.3  Approximation  Formulas  . 

C.1.4  Tables  and  Group  Theory . 

C.l. 5  Counting  with  Generating  Functions 

C.2  Hardness  of  a  Related  Problem . 

C.3  Approximate  Counting  using  Sampling  .  .  . 


118 

118 

118 

119 

119 

120 
123 
125 
127 


Bibliography 


131 


XT 


■  /:' 
i  i* 

i  r  ■ 

•  /?'•-■- 
"  ; 


r 


[  j; 


f 


I. 

I 

f 


Chapter  1 

Techniques  for  Bounding 
Convergence  Rates 


In  this  chapter  we  summarise  techniques  for  bounding  the  rate  of  convergence  of  Markov  chains. 
These  techniques  are  applied  to  get  new  results  in  later  chapters.  This  chapter  is  largely  expository, 
and  is  not  intended  to  present  new  contributions  of  the  author,  although  it  contains  a  few  new 
arguments  in  some  examples. 

We  first  give  a  brief  summary  of  the  basic  ergodic  theory  of  Markov  chains,  and  in  particular  of 
random  walks  on  graphs  viewed  as  time-reversible  Markov  chains.  Readers  that  are  unfamiliar  with 
this  basic  theory  may  find  Karlin’s  text  [Kar68]  helpful.  His  appendix  covering  the  Perron-Frobenius 
theory  of  stochastic  matrices  is  essential  background. 

Then  we  show  how  upper  bounds  on  the  time  to  convergence  can  be  obtained  from  bounds  on 
the  eigenvalues  of  the  Markov  chain.  The  necessary  bounds  on  the  eigenvalues  are  obtained  by 
geometric  means,  involving  “expansion"  and  related  connectivity  properties  of  the  underlying  graph 
of  the  Markov  chain. 

Finally  we  discuss  some  probabilistic  techniques.  These  are  based  on  the  construction  of  certain 
stopping  time  random  variables  having  the  property  that  bounds  on  the  tails  of  their  distributions 
provide  bounds  on  the  variation  distance. 


1.1  Ergodic  Markov  Chains 

Let  V  be  a  finite  set  of  states  and  let  { Xk  \  k  >  0}  be  a  sequence  of  random  variables  taking  values 
in  V  such  that  for  each  k  the  following  property  holds 

Pt{Xk  =  u  |  X0  =  no,  JTi  =  «i . Xt-t  =  n»-i>  =  Pr{X»  =  n  |  X»-t  =  (1-1) 

1 


N  / 

V  ■ 1 1 

I 


/ 


,r 

* 

i  X 


1.2 .  TIME-REVERSIBILITY 


3 


A  probability  distribution  *  on  V  is  called  a  stationary  distribution  of  P  if  it  is  a  left  unit 
eigenvector  of  the  matrix  i.e.,  if  xP  =  t.  We  can  think  of  such  &  r  as  a  fixed  point  of  P,  and  we 
might  then  expect  convergence  of  **  to  such  a  stationary  r  under  certain  conditions.  In  fact,  the 
following  theorem  confirms  this  intuition. 

Theorem  1.1  (Basie  Convergence  Theorem)  Let  P  be  an  ergodic  Markov  chain  on  V.  The 
following  two  conditions  are  equivalent  for  any  probability  distribution  it  on  V : 

1.  it  is  a  stationary  distribution  of  P:  rP  =  r; 

S.  for  every  v  €  V,  Iim„_oe  **(v)  =  *(v)  regardless  of  the  choice  of  r0. 

Furthermore,  there  ezists  a  unique  distribution  r  satisfying  these  conditions.  This  distribution  is 
nonzero  everywhere  on  V . 

This  classical  theorem  tells  ns  that  for  an  ergodic  chain,  no  matter  how  we  choose  the  initial 
distribution,  eventually  the  distribution  of  the  state  approaches  a  unique  stationary  distribution  on 
the  state  space.  For  brevity,  we  say  simply  that  the  process,  rather  than  ‘the  distribution  of  the 
state  of  the  process,’  converges  to  its  stationary  distribution. 

A  Markov  chain  P  is  called  doubly  stochastic  if  PT,  the  transpose  of  the  matrix  P,  is  also  a 
stochastic  matrix.  This  is  the  same  as  saying  that  not  only  the  rows,  but  also  each  of  the  columns 
sums  to  1.  A  chain  P  is  called  symmetric  if  P  =  PT.  Note  that  a  symmetric  chain  is  necessarily 
doubly  stochastic.  The  uniqueness  of  the  stationary  distribution  gives  the  following  corollary. 

Corollary  1.2  If  P  is  an  ergodic  chain  on  S,  it  converges  to  the  uniform  distribution  on  S  if  and 
only  if  P  is  doubly  stochastic.  In  particular,  if  P  is  symmetric,  then  it  converges  to  the  uniform 
distribution  on  S. 

The  theorems  of  this  section  do  not  tell  us  anything  about  how  fast  the  convergence  will  be.  We 
would  like  bounds  on  how  far  the  Jfcth-step  distribution  x*  will  be  from  the  stationary  distribution  x. 
We  spend  the  rest  of  this  chapter  describing  techniques  to  address  this  problem. 

1.2  Time-Reversibility 

Recall  that  the  kth-step  distribution  of  a  Markov  chain  with  transition  matrix  P  is  given  by  x*  = 
—  %qP*,  where  x<)  i*  the  initial  distribution.  In  general,  the  action  of  a  transition  matrix 
P  on  the  state  distribution  x*  is  hard  to  analyse.  However,  when  the  linear  operator  P  can  be 
converted  to  a  matrix  diagonalisable  over  an  orthonormal  basis  of  eigenvectors,  the  action  of  P  is 
reasonably  simple  when  viewed  in  this  basis.  In  this  section,  we  develop  this  idea. 

If  P  is  an  ergodie  chain  with  stationary  distribution  x,  the  time  reversed  PR  of  P  is  the  chain 
whose  matrix  entries  are  Pj*  =  where  x  is  the  stationary  distribution  of  P .  Intuitively,  if 


4 


CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


the  chain  P  is  supposed  to  be  evolving  in  the  stationary  distribution  t,  then  PR  gives  the  probability 
that  P  wo s  in  state  y  at  time  k  -  1  given  that  it  is  in  state  *  at  time  k.  Thus  when  P  is  evolving 
in  the  stationary  distribution,  PR  is  the  transition  matrix  for  the  process  P  observed  with  time 
reversed, 

An  ergodic  Markov  chain  P  with  stationary  distribution  r  is  called  time-reversible  if  P  =  PR. 
Equivalently,  this  means  that  for  each  pair  of  states  *  and  y, 

Many  chains  that  arise  from  physical  problems  are  time-reversible.  In  the  next  section  we  show 
that  the  random  walk  on  any  undirected  graph  is  time-reversible,  which  gives  rise  to  a  number  of 
combinatorially  interesting  chains. 

Note  that  the  condition  that  P  is  both  time-reversible  and  doubly  stochastic  (i.e.,  has  nniform 
stationary  distribution)  is  equivalent  to  the  condition  that  P  is  symmetric.  But  when  P  is  not 
doubly  stochastic,  time-reversibility  gives  a  more  general  symmetry  condition. 

The  algebraic  significance  of  time-reversibility  is  that  it  allows  us  to  transform  P  into  a  diagonal- 
is able  form.  Suppose  P  is  an  ergodic  tune-reversible  Markov  chain  on  a  state  space  V  of  n  elements 
with  stationary  distribution  w,  and  let  R  be  the  diagonal  matrix  with  xth  diagonal  entry  y/r(z). 
Let  M  =  RPR'1.  Then 

Af„  = 

so  M  is  symmetric.  Thus  we  can  write 

m  =  rprT, 

where  T  is  the  orthogonal  matrix  whose  columns  are  the  eigenvectors  of  3f,  and  B  is  the  diagonal 
matrix  of  the  eigenvalues  of  Af,  all  of  which  are  real;  note  that  these  are  also  the  eigenvalues  of  P. 

The  Perron-Frobenius  theorem  insures  that  exactly  one  of  these  eigenvalues  A0  =  1.  that  all 
of  the  other  eigenvalues  have  absolute  value  smaller  than  1.  We  use  Ax  >  Aj  >  •••  >  A*-!  >  -1 
to  denote  the  remaining  eigenvalues  in  decreasing  order.  We  use  A. ,  to  denote  the  largest  absolute 
value  of  any  of  these  non-unit  eigenvalues: 

A.  =  max{|Ai|  |  1  <  <  <  n  -  1}  =  max{|Ax|,  |A»_X|>. 

We  call  A.  the  second  absolute  eigenvalue  of  P.  This  should  not  be  confused  with  |AX|.  They 
sometimes,  but  not  always,  coincide. 

The  columns  of  T  give  an  orthonormal  basis  in  which  we  can  more  easily  treat  P.  Let  I\«  denote 
the  column  of  T  corresponding  to  the  state  *.  Then  B„  is  the  corresponding  eigenvalue  of  P. 
The  values  Aj,0<»<n— 1  are  some  permutation  of  the  values  Btt,  x  €  V.  Let  x  €  V  be  the 
state  such  that  B„  =  A0  =  1.  Note  that  the  eigenvector  T.,  of  M  corresponding  to  A0  has  entries 
r„  =  ^/v(y),  where  t  is  the  stationary  distribution  of  P. 

We  now  state  the  main  reason  for  working  with  time-reversible  Markov  chains. 


1.3.  RANDOM  WALKS  ON  GRAPES 


S 


Theorem  1.8  (Spectral  Representation)  Lei  P  be  a  time-reversible  ergodic  Markov  chain  on  S 
with  itationary  distribution  r.  Then  for  all  x,y  €  S,  we  have 


p{xV  =  *(y)  + 


Proof:  Since  M  =  rSrT  and  T-1  =  rT  by  orthogonality,  we  have  Mk  =  TBkTT;  hence 


»€V 


Wc  also  know  that 


Combine  the  two.  Since  B„  =  A0  =  1,  and  Tt,  =  y/rx  for  all  *,  we  find  that  the  term  corresponding 
to  w  =  x  gives  exactly  x(y).  The  other  terms  appear  in  the  sum.  E3 


1.3  Random  Walks  on  Graphs 


Most  of  the  results  in  this  thesis  deal  directly  with  and  apply  generally  to  time-reversible  Markov 
chains.  However,  many  of  the  chains  that  arise  as  examples  can  be  viewed  as  random  walks  on 
graphs.  Here  we  describe  the  connection. 

Let  G  =  (VtE)  be  a  graph.  (Our  definition  is  nonstandard;  see  Notations  and  Conventions, 
p.  xfF).  The  natural  random  walk  on  G  is  the  Markov  chain  with  state  space  V  determined  by 
the  following  process.  If  the  current  state  is  a  given  vertex  v,  an  edge  {v,  u;}  is  chosen  uniformly  at 
random  amongst  the  degr  possibilities.  This  chosen  edge  is  then  traversed,  and  the  next  state  is 
the  neighbor  w  along  that  edge.  The  transition  matrix  for  this  process  is  given  by: 


P „ 


a^T  > i(v,w)€E 
0  otherwise. 


Theorem  1.4  If  G  =  (V,  £)  is  a  connected  graph  that  is  not  bipartite,  the  natural  random  walk 
process  P  on  G  converges  to  the  stationary  distribution 


*(»)  = 


degt? 


Proof:  Because  G  is  connected,  there  is  a  path  of  possible  transitions  between  every  two  states,  thus 
P  is  irreducible.  Provided  G  is  not  bipartite,  there  exist  both  odd-length  and  even-length  sequences 
of  possible  transitions  from  any  state  x  back  to  x.  This  insures  P  is  aperiodic.  Since  P  is  both 


6 


CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


irreducible  and  aperiodic  it  is  ergodic.  One  can  easily  check  that  the  function  x(v)  =  degv/53,  deg  v 
is  indeed  a  stationary  distribution  of  the  walk. 

[*p]H  =  £  d^T(v) 

_  degta 

E,degti 

degu; 

E.deg'" 

=  »(«.). 

By  Theorem  1.1,  we  know  that  this  is  the  unique  stationary  distribution  and  that  the  distribution 
of  the  position  of  the  walk  will  tend  to  this  distribution.  B 


Note  what  this  says  informally:  in  the  long  term,  the  proportion  of  time  that  the  walk  spends 
on  vertex  t>  is  proportional  to  the  degree  of  the  vertex. 

Corollary  1.5  Suppose  G  is  any  connected  graph  that  is  not  bipartite.  Then  the  stationary  distri¬ 
bution  of  the  random  walk  on  G  is  the  uniform  distribution  on  V  if  and  only  if  G  is  regular. 

Theorem  1.8  If  G  is  a  connected  non-bipartite  graph,  the  natural  random  walk  on  G  is  ergodic  and 
time-reversible. 


Proof:  Let  P  denote  the  walk  process  on  G  =  (V,  E).  By  Theorem  1.4,  the  stationary  distribution 
of  P  on  V  is  x(t>)  =  deg  u/ 53,  deg  u,  and  by  definition,  P„  =  5—.  Now  note,  that  since  (v,v>)  is 
an  edge  only  if  (tr,  v)  is: 


x(v)P,»  = 


E»deg*  E«degu> 


=  x(u>)Pw,. 


There  is  a  much  weaker,  but  useful,  relation  in  the  converse.  Suppose  P  is  ergodic  and  time- 
reversible  on  V.  There  is  an  associated  undirected  graph  on  V  such  that  every  transition  of  P 
corresponds  to  a  step  along  an  edge  of  the  graph.  Suppose  E  denotes  the  set  of  possible  directed 
transitions  e  =  (*,y)  with  Pn  >  0.  These  come  in  pairs,  since  (*,y)  €  E  iff  (y,r)  €  E  follows 
from  reversibility  and  ergodicity.  Let  E  denote  the  set  of  possible  transitions,  {*,  y}>  *  ^  y,  taken 
without  orientation.  We  can  define  an  undirected  simple  gTaph  Gp  —  (Vi  E),  sometimes  called  the 
underlying  graph  of  the  chain.  This  graph  is  simple;  it  has  no  self-loops  or  multiple  edges.  If  P  is 
the  random  walk  on  a  graph  G,  then  Gp  is  obtained  from  G  by  removing  self-loops  and  collapsing 
multiple  edges  to  single  ones. 

This  underlying  graph  will  play  a  large  role  in  the  geometric  eigenvalue  bounds  presented  later 
in  this  chapter. 


1.4.  CONVERGENCE  IN  TERMS  OF  THE  SPECTRUM 


7 


1.4  Convergence  in  Terms  of  the  Spectrum 

In  this  section  we  discuss  bounds  on  the  convergence  rate  based  on  the  eigenvalues  of  the  transition 
matrix.  Later  we  discuss  methods  of  bounding  these  eigenvalues. 

In  order  to  speak  precisely  about  the  convergence  rate,  we  first  need  to  introduce  distance 
measures  between  distributions.  We  will  generally  use  one  of  the  following  two  notions  of  distance 
between  distributions  on  V.  The  total  variation,  [1**  —  *||  is  defined 


-  *n  =  i**M)  -  *U)i- 

ACV 


(1.2) 


It  is  commonly  used  in  statistical  settings.  Another  notion  of  distance  is  handy  for  its  nice  combi¬ 
natorial  properties.  For  r  >  0,  we  say  that  the  distribution  x*  approximates  x  within  ratio 
r  if  for  all  points  v  €  V,  ±xfc(v)  <  ic(v)  <  rxk{v).  The  latter  is  closely  related  to  the  relative 
pointwise  distance 

W  X  |**(v)-*(t?)  n 

rpdfxjk,  x)  =  max - — -  l1-*) 

*  V  '  »€  V"  x(v) 

used  by  Jerrum  and  Sinclair  in  their  work  on  this  topic  [SJ87]  [JS88]  [JS90].  Further  important 
background  material  appears  in  Appendix  A. 


1.4.1  Bounds  for  Reversible  Chains 

We  use  a  lemma  to  prove  our  main  spectral  convergence  bound. 

Lemma  1.7  (Adapted  from  [SJ87].)  Let  P  be  a  time-reversible  ergodic  Markov  chain  with  sta¬ 
tionary  distribution  x  and  second  absolute  eigenvalue  A..  For  any  initial  distribution  x0  on  V  we 
have  x  x, 


rpd(x*,x)  < 


*min 


where  Tmin  =  min,€v  x(t>).  Proofs  By  the  Spectral  Representation  Theorem  1.3,  we  have 

**(y)  =  E^W  =  T(y)  +  E E 

,  s  V 


So  that 


|x*(y)  -  *(y)l  _  ISxyo(z) 

w(y)  ~  y/ir(a)x(y) 


a; 

*min 


*mia 


Here  the  first  inequality  is  obtained  by  extracting  the  maximum  term  and  using  the  fact  that 
X)*  *o(*)  =  1  in  the  numerator.  The  denominator  is  bounded  trivially.  Then  the  second  inequality  is 


8 


CHAPTER  I.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


obtained  by  bounding  |££j  by  |A?  |,  and  using  the  fact  that  |  r„,I'lr.r|  <  1,  by  the  orthogonality 

of  r.  B 


The  following  theorem  puts  these  in  a  convenient  form,  summariiing  the  A. -based  convergence 
bounds. 

For  the  chain  P,  with  stationary  distribution  r  and  second  absolute  eigenvalue  A.  define  the 


function 


7>(e)  = 


1 

1- A. 


(In—) 

**min 


(1.4) 


Theorem  1.8  For  any  initial  distribution  x0,  the  following  holds  for  any  positive  t  <  1.  Whenever 


k  >  TP(e) 


we  have 

ip  d(*k,*)<2> 

and 

lln-*ll<  j, 

and 

**  approximates  x  within  ratio  1  +  f. 

Proof:  The  relative  pointwise  distance  bound  here  follows  from  the  preceding  lemma,  by  expressing 
the  bound  there  in  terms  of  the  exponential,  and  using  the  Taylor  expansion  for  the  logarithm. 
The  relative  pointwise  distance  bound  implies  the  variation  distance  bound  by  Proposition  A.2  in 
Appendix  A.  The  bound  for  approximation  within  ratio  follows  from  tfce  bound  on  relative  pointwise 
distance  and  the  fact  that  approximation  to  within  relative  error  c/2  implies  approximation  within 
ratio  c.  (See  Appendix  A.)  E3 

In  general  we  will  call  a  function  /(c)  a  convergence  guarantee  for  the  chain  P  if  for  every 
initial  distribution  x0,  taking  k  >  /(c)  insures  rpd(x*,  x)  <  c/2,  (which  implies  also  that  the  bounds 
on  variation  distance  and  approximation  within  ratio  given  in  the  preceding  theorem  hold).  Thus, 
Theorem  1.8  states  that  7>(c)  is  a  convergence  guarantee  for  P,  but  there  may  be  better  guarantees 
available. 

In  some  cases  we  may  be  able  to  determine  the  full  spectrum  of  P.  When  this  happens,  and  the 
chain  P  is  not  only  time-reversible,  but  symmetric,  we  can  get  the  following  bound  utilising  the  full 
spectrum  of  P.  This  version  often  gives  provably  tight  answers. 

Theorem  1.0  (Symmetric  Case,  Full  Spectrum  Bound)  [AD86]  If  P  is  symmetric ,  then  for 
every  initial  distribution  x<>  w 

||n-P||*  <  l(Tr[Puj-  1). 


IA,  CONVERGENCE  IN  TERMS  OF  TEE  SPECTRUM 


9 


These  will  be  our  basic  tools  for  proving  upper  bounds  on  convergence  rates.  However,  these 
theorems  alone  provide  no  clue  of  how  to  bound  the  spectrum,  a  problem  that  in  general  is  hard  for 
large  state  spaces.  We  will  see  techniques  for  bounding  the  spectrum  in  the  next  section. 

1.4.2  Strong  Aperiodicity 

The  necessity  of  bounding  the  second  largest  eigenvalue  in  absolute  value  poses  a  minor  technical 
problem,  because  most  of  the  techniques  available  for  bounding  the  spectrum  only  give  us  information 
about  Alt  rather  than  A„.  The  following  theorems  provide  a  simple  method  of  converting  a  Markov 
chain  to  a  time-reversible  one  with  nonnegative  eigenvalues,  and  the  same  stationary  distribution. 

A  Markov  chain  P  is  called  strongly  aperiodic  or  diagonally  dominant  if  PXT  >  \  for  each  x, 
that  is,  if  at  each  state  the  probability  of  immediate  return  is  at  least  1/2. 

Theorem  1.10  If  P  is  irreducible  and  strongly  aperiodic  then  rAl  of  its  eigenvalues  are  nonnegative, 
so  that  then ,  in  particular,  A*  =  A.. 

Proof:  Consider  the  matrix  M  =  2P  -  I.  This  is  a  stochastic  matrix  that  is  irreducible,  (not 
necessarily  aperiodic).  Applying  the  Perron- Frobenius  theorem,  the  eigenvalues  of  M  satisfy 
_1  <ui<  1.  The  eigenvalues  A,  of  P  are  related  to  the  eigenvalues  Ui  of  M  by  A<  =  +  1)  >  0. 

0 


One  can  get  a  strongly  aperiodic  process  from  any  Markov  process  P  as  follows.  The  proof  is 
obvious. 

Theorem  1.11  If  P  is  an  irreducible  chain  then  Pf  =  §(/  +  P)  is  ergodic  and  strongly  aperiodic, 
and  has  the  same  stationary  distribution  as  P.  If  P  is  time-reversible,  then  so  is  P'.  If  P  is  doubiy 
stochastic,  then  so  is  P' .  If  P  is  symmetric,  then  so  is  P' . 

For  these  reasons,  we  will  sometimes  speak  of  the  strongly  aperiodic  form  of  a  chain  P,  by 
which  we  mean  the  chain  P'  =  §(/  +  -P)-  can  imagine  this  chain  as  the  following  process.  At 
each  step  flip  a  fair  coin.  If  the  coin  comes  up  ‘heads’  make  a  move  according  to  P,  otherwise  remain 
in  the  current  state. 

1.4.3  Reversibilization 

Similarly,  we  can  also  make  a  reversible  chain  horn  any  ergodic  one. 

Lemma  1.12  If  P  is  an  ergodic  chain,  with  stationary  distribution  then  the  stationary  distribu¬ 
tion  of  PR  is  also  v. 


10 


CHAPTER  1 .  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


Proof:  Siraply  verify  that  tPr  =  x.  For  each  y,  we  have 

[*pk](  y)  = 

* 

=  *(v)  p»* 

=  *(»)  * 

where  the  last  equality  holds  because  P  is  stochastic.  H 


Theorem  1.13  Let  P  be  an  ergodic  chain  with  stationary  distribution  r.  Let  PR  be  the  time- 
reversal  of  P ,  and  define  P '  =  |(P  +  PR).  P'  w  ergodic,  time-reversible,  and  also  has  the  stationary 
distributer  x.  Furthermore,  if  P  is  strongly  aperiodic,  then  so  is  Pf. 


Proof:  It  is  dear  that  since  P  is  ergodic,  so  is  P\  From  the  lemma  above,  r  is  the  stationary 
distribution  of  PR,  thus  also  of  P'.  Now  to  show  that  P'  is  time-reversible,  we  need  only  show  that 
(P')R  =  P\  It  is  easy  to  see  that  the  (x,  y)  entries  are  equal: 


zMp' 

1,r(y)rp  ,  *(*)p  1 

2  x(s)  ***  +  l^j) 

ifZiiil  p  4.  p  1 

2{r{x)Fyx  +  P*»J 


Finally  note  that  Pxx  =  PIX.  Thus  P'  will  be  strongly  aperiodic  exactly  when  P  is  strongly  aperiodic. 


Fill  [Fil90]  gives  some  useful  theorems  relating  convergence  rate  bounds  of  the  non-reversible 
chain  to  its  reversible  version.  He  also  discusses  a  different  multiplicative  form  of  reversibilization, 
given  by  PPR.  This  multiplicative  reversiblixation,  however,  does  not  always  preserve  ergodiaty. 

The  following  example  indicates  that  reversibilization  can  lose  accuracy,  but  that  there  may  not 
be  any  natural  useful  spectral  convergence  bounds  in  the  absence  of  reversibility. 

Example  1.14  The  de  Bruijn  graph  of  order  n  is  the  directed  graph  with  N  =  2n  vertices  labdled 
by  the  integers  modulo  2n,  ard  having  a  directed  edge  from  the  vertex  labelled  x  to  that  labelled 


1.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


11 


y  if  y  =r  2*  o:  y  =  2x  4-  1  each  taken  modulo  2n.  This  can  be  viewed  as  left- shifting  the  binary 
representation  of  r,  discarding  the  left-most  bit,  and  either  a  0  or  a  1  in  the  right-most 

coordinate.  The  gTaph  is  directed,  and  the  natural  random  *  'ik  on  this  graph  is  not  time-reversible. 
It  is,  however,  ergodic.  It  is  not  hard  to  see  that  one  bit  gets  ‘randomised’  with  each  step,  io  that 
the  position  of  a  random  walk  on  this  graph  is  exactly  uniform  after  exactly  n  steps.  After  n  -  1 
steps  the  distribution  is  supported  on  only  half  of  the  vertices ,  *o  the  total  variation  is  large.  The 
nth  step  yields  exactly  the  uniform  distribution.  The  only  hon-unit  eigenvalues  of  the  transition 
matrix  are  all  rero,  precluding  any  bound  directly  involving  multiplicative  factors  like  Aj  or  A}. 

The  convergence  of  the  reversible  chain  Pf  =  |(P  —  P*),  on  the  other  hand,  can  be  related  to 
a  symmetric  random  walk  on  a  line  segment  of  length  n,  and  the  time  required  for  convergence  of 
P'  can  be  shown  to  be  ©(n2).  Thus,  making  the  chain  reversible  in  this  way  can  cause  a  significant 
loss  in  the  rate  to  stationarity. 

Using  the  multiplicative  reversibiliiation  here  does  not  seem  to  help;  the  chain  PPR  is  not  even 
ergodic.  □ 


1.5  Geometric  Eigenvalue  Bounds 

Some  of  the  better-known  ways  to  get  bounds  on  the  eigenvalues  of  the  transition  matrix  fail  in  the 
contexts  that  interest  us.  Methods  involving  direct  computation  with  the  matrix  (see  [GL89])  can 
give  accurate  approximations  to  the  eigenvalues  for  small  matrices  and  when  the  matrix  can  be  given 
explicitly,  but  are  not  generally  useful  when  we  wish  to  prove  bounds  for  whole  classes  of  chains, 
where  the  matrix  is  only  implicitly  known  or  specified,  and  when  the  state  space  is  exponentially 
large  in  the  natural  parameters.  Likewise,  bounds  such  as  those  of  the  Gershgorin  type  and  those 
described  in  [CDS80]  are  typically  unusable  for  the  large  sparse  matrices  that  we  encounter.  They 
tend  to  give  results  like  A.  <  1,  which  we  know  in  any  case  by  the  Perron-Frobenius  theorem. 

In  this  section  we  present  some  bounds  that  do  seem  to  be  useful  for  the  type  of  problems  that 
we  wish  to  consider.  These  techniques  are  essentially  geometric.  They  give  bounds  in  terms  of 
geometric  properties  of  the  underlying  graph  of  the  chain. 

Later,  in  Chapter  2,  we  will  briefly  discuss  harmonic  analysis,  which  is  a  different  method  that 
has  been  used  successfully  to  get  bounds  on  the  eigenvalues  of  certain  Markov  chains  associated  with 
groups. 


1,5.1  The  Laplacian 

Let  P  be  a  reversible  ergodic  Markov  chain  on  a  finite  state  space  V  with  stationary  distribution 
We  associate  to  P  a  Laplacian  L  =  1  -  P,  where  I  is  the  identity  matrix.  We  will  view 
this  as  an  operator  on  the  functions  4>  :  V  —  R,  which  we  can  also  view  as  (row)  vectors  in 
(Viewing  such  a  function  <f>  as  a  vector,  the  action  is  then  given  by  4>L.)  We  impose  an  inner-product 


12 


CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 

(^,V’)»  =  E.  ^(t.')V'(f)f(v).  For  example,  if  1  =  (1, 1,  — ,  1)  then  (tf.l).  i*  the  mean-value  of 
under  the  distribution  *\ 

Since  P  is  ergodic,  we  have  $L  =  0,  the  sero  function,  if  and  only  if  ^  =  c*  for  some  c.  The 
operator  L  has  |V|  real  eigenvalues  0  <  i  <  |V|,  related  to  those  of  P  by  as  1  —  A,.  Thus 
taking  the  eigenvalues  in  increasing  order  and  in  multiplicity,  we  have 

/iO  =  0<jii<jJ2<*-*<  M|V|-i  <  2. 

Lower  bounds  on  can  be  translated  to  upper  bounds  on  Aj  of  P  via 

1  -  Pi  =  A|. 

In  this  section,  we  will  discuss  bounds  on  p\.  It  should  be  understood  that  these  can  then  be  used 
directly  to  bound  the  convergence  rate  by  applying  the  theorems  of  the  previous  section. 

Let  E  denote  the  set  of  possible  directed  transitions  e  =  (x,y)  with  P n  >  0.  Recall  that  the 
directed  transitions  come  in  pairs;  (x,y)  €  E  iff  (y,  x)  €  E  from  reversibility  and  ergodicity.  Let  E , 
again,  denote  the  set  of  possible  transitions,  {x,y},  x  ^  y  taken  without  orientation.  Recall  that 
Gp  =  ( V,  £)  is  called  the  underlying  graph  of  P  and  is  simple. 

Define  F(x,y)  =  x(z)Pj:,  =  *(y)P,x.  Note  F(x,y)  is  nonsero  only  when  (x,y)  is  a  possible 
transition.  It  is  a  symmetric  matrix,  by  reversibility.  For  x  ^  y,  we  may  thus  write  F(e)t  for  e  €  E 
without  ambiguity.  It  can  be  thought  of  as  the  *fiow*  of  probability  mass  over  the  edge  e  when  the 
chain  is  evolving  in  the  stationary  distribution. 

For  e  =  (x,y)  €  E ,  define 

V*(e)  =  Mv)  -  *(*))• 

For  4>,  $  :  V  -+  R,  define  the  Dirichlet  form 

(4>L,ifi)w  =  ]T(tfIr)(v)\(>(v)ir(ti)  =  i  ”  ^(*))C^(v)  ~  V(*))F(*.y)- 

*  **» 

This  is  the  quadratic  form  on  L  in  the  space  of  real-valued  functions  on  V  with  inner  product 
(^,  ss  One  can  see  that  the  form  is  symmetric  (L  is  self-adjoint  in  this  space), 

and  as  we  already  knew,  positive  semi-definite.  Using  our  notation,  we  can  write 

(*L,  *).  =  |  £(V*(e))’F(e)  =  £(**(«))’f(«)-  (»•*> 

«€*  •« 

In  the  latter  equality,  the  |  factor  and  E  have  been  replaced  with  the  set  E  of  undirected  transitions. 
In  making  this  replacement  we  have  used  the  fact  that  the  undirected  edge  {x,  y}  is  represented  twice 
by  directed  edges,  once  by  (x,y)  and  again  by  (y,x),  each  contributing  the  same  value  of  the  term 
(V^(e))2F(e),  and  also  that  V^(x,  y)  ss  0  for  x  ss  y.  For  the  undirected  edges  in  the  resulting  form, 
these  terms  may  be  computed  with  either  orientation  assigned  to  the  edge;  the  squaring  and  the 
symmetry  of  F(e)  makes  the  choice  irrelevant. 


;.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


n 


Rayleigh’s  principle  gives 


Pi 


=  I  (♦.!).  =  0.^0} 

(Pi  <?)* 

l<2>, 


(1.6) 

(1.7) 


the  infimum  being  attained  precisely  at  any  eigenfunction  Mi- 

This  is  a  weighted  form  of  the  Rayleigh  quotient  of  the  graphical  Laplacian  Q(GP)  for  the 
underlying  graph  GP.  The  graphical  Laplacian  of  a  graph  G  is  <?(G)  -  D  -  A  where  D  is  the 
diagonal  matrix  of  degrees,  D„  =  degv,  and  A  is  the  adjacency  matrix  of  the  graph.  This  is 
symmetric  and  positive  semi-definite,  with  smallest  eigenvalue  0,  and  smallest  positive  eigenvalue  t>i 
given  by 

H  I  (»■  1)  =  0, 0  *  0},  (1-8) 

(9.  <P) 

where  {$,  d>)  is  the  usual  C *  inner  product. 

This  fact  can  be  used  to  translate  known  lower  bounds  for  i/j  to  lower  bounds  for  m j.  In  particular 
the  following  property  is  immediate. 


Theorem  1.15  If  P  is  an  ergodic  symmetric  chain  on  V  then 

pv j  <  Mi  <  p'vi 


where 

p=  min  Pti  and  p'  =  max  P,t 

are  respectively  the  minimum  and  maximum  nonzero  probability  of  any  transition  (x,  y),  x  r  V- 

Proof:  The  stationary  distribution  is  uniform  on  V;  x(v)  =  yp-j-  Thus  the  inner-product  (^,V>)» 
is  the  usual  C 3  inner-product  scaled  by  the  multiplicative  factor  1/|V|.  Also  F(x,  y)  =  ^  must  lie 
between  ^  and  fc.  Plugging  these  into  the  quotient  of  (1.7),  bringing  out  p  in  the  numerator  and 
refiling  i/|Vj  from  both  numerator  and  denominator  gives  the  result.  B 

Remark:  The  relationship  of  the  eigenvalues  of  Q(G)  and  connectivity  properties  of  the  graph 
has  been  known  and  studied  for  some  time.  The  eigenvalue  i/i  was  called  "algebraic  connectivity* 
by  Fiedler  [Fie73]  who  showed  some  relationships  to  the  standard  notion  of  edge-connectivity.  The 
relation  between  this  eigenvalue  and  expansion  properties  in  graphs  has  been  investigated  by  Alon 
and  co-authors  [Alo86]  [AM85]  [AGM87]  as  well  as  numerous  others.  These  are  the  Cheeger-type 
bounds  of  the  next  section.  The  relation  of  the  smallest  positive  eigenvalue  to  isoperimetric  quantities 
in  the  continuous  realm  was  known  earlier  [Che70].  The  Poincare-type  bounds  we  will  see  later  seem 
also  to  have  earlier  analogues  in  continuous  cases.  There  seems  to  be  a  wealth  of  deep  and  interesting 


14 


CHAPTER  2  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


questions  to  pursue  in  trying  to  find  other  useful  connections  to  known  results  in  the  continuous 
case. 

The  operators  L(P)  and  Q(<7)  are  discrete  analogues  of  the  continuous  Laplacian  operator. 
Their  Rayleigh  quotients  in  (1.7)  and  (1.8)  are  the  natural  discrete  analogues  of  that  arising  in  the 
“membrane”  problem  under  “free”  (Neumann)  boundary  conditions,  with  sums  replacing  integrals. 
The  Laplacian  figures  prominently  in  the  diffusion  theory  of  heat,  sound,  and  fluids.  Here  is  some 
useful  intuition:  A  random  walk  on  a  regular  graph  starts  out  as  a  point  mass  at  a  vertex  of  the 
graph,  and  the  transition  matrix  P  (or  alternatively  the  Laplacian  L)  serves  as  the  diffusion  operator, 
smoothing  the  measure  to  uniformity.  This  is  analogous  to  the  situation  of  a  free  membrane  that  is 
“poked”  at  a  point,  causing  a  wave-like  spreading  vibration  which  eventually  settles  to  uniformity. 
(This  idea  can  be  exploited  to  give  interesting  graphic  displays  of  the  evolution  of  random  walks  on 
subgraphs  of  the  plane  grid.) 

The  matrix  Q(G)  also  appears  in  &  well-known  determinant  formula  for  the  number  of  spanning 
trees  of  the  graph.  Q(G)  is  also  equal  to  the  product  CCT  where  C  is  the  |V|  x  \E\  incidence  matrix 
of  the  vertices  to  edges  of  any  directed  orientation  of  G.  (See  [Big74,  page  35]  and  [Knu68,  Exercises 
2.3.4.2.18-20])  D 


1.5.2  Cheeger- Type  Bounds 

There  is  a  naturally  motivated  relationship  between  the  expansion  properties  of  a  graph,  and  the 
eigenvalues  of  the  Laplacian.  Cheeger  [Che70]  proved  a  lower  bound  on  the  smallest  positive  eigen¬ 
value  for  the  Laplacian  on  Riemannian  manifolds.  Here  we  give  two  discrete  versions  of  that  theorem 
which  hold  for  reversible  Markov  chains. 

As  before,  let  P  be  a  reversible  ergodic  chain  on  V  with  stationary  distribution  w.  If  5  is  any 
subset  of  states,  let  5  =  V  —  5  and 

C(S)  =  {e  €  E  |  e  =  {x,y},*  €  5,  y  €  5} 


denote  the  set  edges  crossing  the  “cut”  between  5  and  S  and 

TO)=  E 

•€C<5) 


Define 


n  s  min  — — — 

Srr(S)<l/2  *(S) 


The  notation  k  follows  Cheeger,  but  following  Sinclair  and  Jerrum,  who  proved  the  next  theorem, 
we  call  this  the  conductance  of  P.  Intuitively,  this  quantity  is  a  measure  of  the  ability  of  the  chain 
to  admit  “flow*  F(C(S)  of  probability  mass  out  of  any  set  5,  adjusted  by  the  weight  of  5  in  the 
stationary  distribution.  If  h  is  high,  there  are  no  “bottlenecks*  in  the  flow,  and  one  expects  that 
convergence  will  then  be  rapid.  The  following  theorem  confirms  this  intuition. 


1.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


15 


In  the  case  that  P  is  the  random  walk  on  a  d-regular  graph,  note  that  the  quantity  h  reduces  to 

.  _  .  |C(S)| 

5;|si<1|V|/2  d|S| 

Theorem  1.10  (Conductance  Bound)  [S J87) 

Let  P  be  an  ergodic  time-reversible  Markov  chain,  let  L  be  the  Laplacian  for  P,  and  let  Pi  be  the 
smallest  positive  eigenvalue  of  L.  If  h  is  the  conductance  of  P  defined  above,  then  pt  satisfies 

—  <  in  <  2  h 

2  “ 

There  is  another  version,  based  on  a  ‘vertex’  expansion  quantity  called  magnification,  which 
we  denote  c.  In  some  cases  it  will  give  better  bounds  if  we  can  avoid  converting  to  the  edge-based 
version  when  a  magnification  bound  is  known. 

For  a  given  set  of  vertices  5  C  V,  let  Nbd(S)  denote  the  set  of  nodes  that  are  neighbors  of  nodes 
in  5,  but  axe  not  in  5: 

Nbd(S)  =  {y  |  y  e  s,(3x  €  S)[(x,y)  €  £]}. 


|Nbd(S)| 
i.-is^ivi/j  \S\ 


Call  a  graph  that  has  magnification  c  a  c-magnifier. 


Theorem  1.17  (Magnification  Bound)  [AI086]  Let  P  be  an  ergodic  symmetric  chain  with  wi- 
derlying  graph  Gp,  and  let  p  =  min(*,,)e£  P*,  and  p'  =  max{t,,}€£;  Pr,.  Lei  L  be  the  Laplacian  of 
P,  and  pi  be  its  smallest  positive  eigenvalue.  If  Gp  is  a  c-magnifier  then  p\  satisfies 


2(1 -c)‘ 


Proof:  Apply  Theorem  1.15  together  with  the  lower  bound  and  upper  bounds  for  »i(Q(GP)) 

provided  by  Lemma  2.4  and  Theorem  2.5  of  Alon  [AI086].  © 


A  family  of  graphs  Q  -  {G„}  where  |G„|  =  n,  is  a  family  of  (d,  c)-magnifiers  if  for  all  n,  G„ 
is  d- regular  (d  constant  in  n)  and  magnification  at  least  c  (again  constant  in  n).  It  is  an  immediate 
consequence  of  the  preceding  theorem  that  a  family  of  d-regulai  graphs  is  a  magnifying  family 
precisely  when  n(Gn)  is  bounded  away  from  0  as  n  increases,  (equivalently  if  the  same  holds  for  pi 
for  the  natural  random  walk  on  G„).  By  Theorem  1.8,  the  random  walk  on  G*  has  a  convergence 
guarantee  that  is  0(ln  n+ln(l/«)).  It  is  easy  to  see,  that  as  a  function  of  n,  this  is  the  best  possible; 
no  f»Tni1y  of  d-regular  graphs  (with  d  fixed)  can  have  a  convergence  guarantee  that  is  o(lnn),  as  n 
grows  (taking  any  fixed  c  <  1).  For,  if  the  walk  starts  on  a  vertex  u0  in  <?«.  *° *0  i*  »  point-mass 
on  «0i  and  if  k  =  /(n)  =  o(log*  n)  steps  are  taken,  then  the  distribution  n  is  supported  on  at  most 
d4  -  o(n)  vertices.  Since  G,  is  regular,  the  stationary  distribution  is  uniform,  thus  the  variation 
distance  ||**  -  t||  will  approach  1  as  n  grows. 


I1  ■! 


1 


16  CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


For  large  n,  almost  all  d-regular  graphs  of  sise  n  are  good  magnifier*  [Alo86],  *o  for  moat  large 
d-regular  graphs  G,  the  random  walk  on  G  converges  rapidly. 

Altough  we  generally  know’  that  we  are  not  dealing  with  a  magnifying  family,  bounds  on  the 
quantities  h  and  c  will  still  yield  bounds  on  the  convergence  rate.  Bounding  h  or  c  means  solving 
what  is  called  an  “isoperimetric  problem,”  the  graphical  version  of:  *How  much  volume  can  one 
enclose  with  given  perimeter?’  Equivalently,  one  can  ask  how  large  a  perimeter  is  necessary  to 
enclose  a  given  volume.  The  Greeks  knew  that  in  the  plane,  the  circle  had  smallest  perimeter  for 
given  enclosed  area,  but  probably  had  no  proof.  Similar  results  hold  in  Eucidean  space  of  higher 
dimensions  [Ban80].  The  problem  becomes  more  interesting  when  the  question  is  asked  in  a  bounded 
domain,  where  the  boundary  of  the  domain  is  not  counted  in  the  perimeter  of  a  region.  This  is  the 
sort  of  problem  we  have,  except  that  here  the  domain  is  a  graph,  volume  b  the  number  of  nodes, 
and  the  perimeter  of  a  region  b  the  number  of  edges  or  nodes  at  the  region’s  boundary. 

Remark:  The  conductance  version  was  first  proved  by  Sinclair  and  Jerrum  [SJ87].  Diaconb  and 
Stroock  [DS8S]  give  a  simpler  proof.  The  vertex-based  version  of  Cheeger’s  inequality  was  proved 
earlier  by  Alon  [Alo86]  using  the  Max-Flow/Min-Cut  theorem.  Alon’s  goals  were  in  the  reverse 
direction,  to  give  lower  bounds  on  expansion  properties  using  knowledge  of  the  eigenvalue.  O 


1.5.3  Canonical  Path  Arguments 

Canonical  path  arguments  give  a  technique  for  lower-bounding  the  Cheeger  quantity  h.  Systems  of 
paths  are  also  used  in  the  Poincare-type  bounds  presented  later. 

Let  P  be  a  reversible  ergodic  chain.  For  each  x  and  y  choose  some  canonical  path  7*y  from  x 
to  y.  Call  thb  system  of  paths  I\  Define  the  (weighted)  covering  number  cf  T 

r,(r)  =  mucj^  £  *(•)*(»)•  (L9) 

where  the  maximum  b  over  all  possible  transitions  e  of  the  chain,  and  the  sum  b  over  paths  7*y 
that  contain  e.  Thb  b  a  measure  of  how  heavily  the  system  of  paths  uses  any  one  edge. 


Theorem  1.18  (Conductance  in  Terms  of  Paths)  If  P  has  a  system  T  of  canonical  paths  with 
17  =  rj(r),  then  the  conductance  h  of  P  satisfies: 

h*T' 

2  n 

Proof:  Let  5  C  V  with  x (5)  <  Then 


»(S)»(5) 


F2  x(*Mv) 

*€S,teJ 


•es,rel  .€(-r„nc(S)) 


F(e) 

F(e) 


< 


(1.10) 


1.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


17 


=  E  fWijEj  E  *(*)*(»»  ('■“) 

c€C*(S)  cG  S,jr€SA*r«*9* 

<  i  E  fw 

e€C(S) 

=  t)F(C(S)).  (113) 

Id  line  1.10  we  have  used  the  fact  that  for  any  z  6  S  and  y  &  S,  the  path  7ijr  must  cross  the  cut, 
and  so  contain  an  edge  in  C(S).  This  insures  we  are  multiplying  by  a  factor  that  is  at  least  1.  Now 
using  the  fact  that  x(S)  <  j,  so  that  it (S)  >  |, 

F(C(5))  1 

x(5)  ~  2rj’ 


The  theorem  has  an  easy  combinatorial  interpretation  in  the  case  that  P  is  a  random  walk  on 
a  regular  graph.  To  illustrate,  let  G  be  a  d- regular  connected  non-bipartite  graph,  and  let  P  be  a 
random  walk  on  G.  Recall  that  then 

,  min  J£i£)j 

S:|S|<|V|/J  d\S\ 

The  problem  of  lower-bounding  k  reduces  to  that  of  giving  a  lower  bound  for  the  ratio  |C(S)|/|S|. 
Suppose  that  we  have  a  system  of  paths  T  such  that  no  edge  appears  more  than  6  times  (total  over 
all  paths).  We  call  b  the  unweighted  covering  number  of  I*. 

Now  consider  an  arbitrary  subset  of  vertices  SC  V.  There  are  exactly  |  S  j  x  |S|  paths  that 
go  from  S  to  S.  Since  no  edge  e  from  S  to  S  appears  more  than  b  times  over  all  these  paths,  there 
must  be  at  least  |5||5|/6  edges  that  cross  from  S  to  S.  Thus  C(S)  >  |S||S|/i.  Combining  this  with 
the  supposition  that  |S|  <  | V|/2  (hence  |S|  >  |V|/2),  we  get 

h>B 

h~  2 db‘ 

If  we  calculate  >?  for  this  chain,  we  get  t?  =  db/|V|,  and  thus  the  same  h  >  For  general  graphs, 
we  have  the  following. 

Theorem  1.19  If  P  is  the  natural  random  walk  on  a  connected  non-bipartite  graph  G  with  maximum 
degree  d,  and  G  admits  a  system  of  canonical  paths  in  which  no  edge  appears  more  than  b  times  then 

,  ^  E.  de8p 


Proof:  Since  the  stationary  distribution  is 


*(»)  = 


E.degu’ 


18 


CHAPTER  L  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


we  have  for  every  e  €  E, 

F(e)=  - . 

Plugging  in  the  bounds  of  the  hypothesis  gives 

_  .  d7b 

r)  £  j  » 

yielding  the  desired  bound.  E3 


Plugging  this  into  Theorem  1.16,  gives  the  following. 


Corollary  1.20  With  the  hypothesis  of  the  preceding  theoremt 


Mx  > 


1 

2  V  2 d2b  J 


So  if  we  can  define  a  system  of  canonical  paths  where  no  edge  appears  in  too  many  paths,  we  will 
get  a  small  6,  yielding  a  large  Cheeger  value  h,  yielding  a  good  lower  bound  on  jxi  for  the  Lapladan. 
Any  system  of  canonical  paths  is  sufficient  to  give  an  immediate  bound;  see  Theorem  1.27.  We  give 
two  simple  examples  here. 


Example  1.21  (Line)  Let  G  be  the  graph  on  the  set  V  =  [nj  where  there  is  an  edge  joining  every 
pair  of  vertices  whose  absolute  difference  is  1,  and  a  self  loop  edge  on  each  endpoint.  We  call  this  the 
n- segment.  The  self-loop  at  each  endpoint  makes  the  graph  regular  and  the  walk  aperiodic.  There 
is  a  unique  shortest  path  between  every  pair  of  vertices  z  and  y.  Choose  7®,  to  be  this  path.  Consider 
any  edge  e  =  {x,  z  -I- 1}.  In  a  given  direction,  without  loss  of  generality  say  left-to-right,  e  is  traversed 
by  those  paths  for  which  t><*  and  w>*  +  l.  That  is,  the  directed  edge  (z,  z  «f  1)  is  contained 
in  x(n  -  z)  paths.  This  is  at  most  n2/ 4,  so  no  edge  is  contained  in  more  than  b  =  n2/ 4  paths.  We 
conclude  that  h  >  1/n.  Moreover,  for  even  n,  cutting  the  line  in  the  middle  so  that  5  =  [n/2],  we 
get  just  this  value  of  h .  So  this  is  an  optimal  conductance  bound.  This  gives  pi  >  l/(2n2),  which 
is  correct  within  a  constant  factor,  the  actual  value  being  1  —  cos(x/n)  %  x2 /»2.  Using  this,  for  the 
strongly  aperiodic  walk  we  get  a  bound  of  A.  <  1  -  1/n2.  Thus  Tp(e)  =  n2  ln(2 n/e)  is  a  convergence 
guarantee  for  the  strongly  aperiodic  walk.  This  is  too  large  by  about  a  Inn  factor.  The  error  is 
due  to  the  fact  that  the  bound  is  based  only  on  the  second-largest  eigenvalue.  A  coupling  argument 
(later)  shows  0(n2)  convergence,  and  an  0(n2)  bound  using  the  full  spectrum  is  given  in  Chapter  2. 
□ 

Example  1.22  (Hypercube)  Let  V  =  {0, 1}*,  and  let  G  be  the  graph  of  the  d*dimensional  hy¬ 
percube,  with  a  self-loop  edge  at  each  vertex.  This  is  the  graph  on  V  with  an  edge  between  every 
two  binary  d-tuples  that  differ  in  at  most  one  coordinate.  For  any  two  vertices  z  and  y  define  the 
canonical  path  from  z  to  y  as  that  path  which  brings  z  into  agreement  with  y  by  ^correcting  each 


r 


1.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


19 


coordinate  from  left  to  right  in  the  binary  representation.  Thus  the  canonical  path  from  (0,1, 1,0) 
to  (1,0, 1, 1)  in  the  4-dimensional  case  is:  (0, 1, 1, 0)  to  (1, 1, 1,0)  to  (1,0, 1,0)  to  (1,  0, 1, 1). 

No  directed  edge  e  is  used  by  more  than  2^”*  paths.  To  see  this,  first  note  that  any  edge  e  in 
the  path  from  x  =  (x1(  *j, . . . ,  xd)  to  y  -  (yu  Vi,  •  •  •.  Vi)  j«n*  two  points  of  the  form 

(yi, yr, •  •  •  i yi-ii *«i *>+ii *«+j> •  •  •  > *4)  an^  (yi > vji •  •  v* > *«+i» *.+2>  •  •  •> *<)• 

The  ith  coordinate  is  the  one  “corrected”  by  this  step  of  the  path.  Now  if  we  were  given  the 
edge  e  and  the  binary  (d  -  l)-tuple  (x1,xj,---,*i-1,yi+i,yi+2,---,yi)  we  could  determine  all  of 
the  coordinates  of  x  and  of  y,  and  hence  we  could  reconstruct  the  whole  path  7,,.  Thus  we  can 
encode  the  set  of  paths  through  e  by  elements  of  {0,  l}*1-1.  More  formally,  there  is  an  injection  from 
{e}  x  {0,  l}11'1  into  {7*,}.  It  follows  that  no  more  than  2*-1  paths  use  a  given  directed  edge  e. 
Consequently,  h  >  l/(d+  1).  This  gives  a  bound  of 

Mi  >l/(2(d+l)J), 

which  implies  that 

!>(<)  =  4(d+l)3ln(2/<) 

is  a  convergence  guarantee  for  the  strongly  aperiodic  walk.  This  is  an  order  of  d?/\nd  from  the 
“right"  answer,  as  we  will  discover  later  by  obtaining  the  complete  spectrum. 

Although  the  eigenvalue  bound  is  incorrect,  the  lower  bound  on  the  conductance  h  obtained  by 
this  argument  is  optimal;  there  are  cuts  5  C  V  attaining  this  value  of  h.  For  example,  take  5  to  be 
the  subcube  of  points  with  a  0  in  the  first  coordinate.  It  should  be  noted  that  an  argument  based 
on  c  (magnification)  does  not  do  asymptotically  better.  (The  relevant  minimal  cut  for  determining 
c  for  d  even  is  a  Hamming  ball  centered  at  (0,0, .  ..,0)  and  containing  half  the  vertices,  and  this 
also  gives  the  same  pi  =  ft(l/d2)  bound.  For  proofs  see  [Bol86,  Chapters  5  and  16].)  □ 

There  is  a  second  idea  hidden  in  the  last  example.  It  is  a  method  of  bounding  the  breadth  6,  for  a 
given  system  of  canonical  paths,  and  this  aspect  of  the  argument  is  due  to  Jerrum  and  Sinclair  [JS88]. 
Suppose  we  can  “encode"  the  set  of  paths  that  use  any  edge  e  by  elements  of  a  set  Bt  so  that  given 
e,  any  particular  path  jxy  that  contains  e  can  be  reconstructed  given  the  additional  information  of 
an  element  from  Bt .  Then  at  most  b  =  max«  |2?,|  use  a  given  edge.  In  the  previous  example,  paths 
containing  a  given  edge  were  encoded  using  the  (d  —  l)-tuple,  and  Bt  =  {0,  \Bt  \  =  24~l  for 

all  e. 

1.5*4  Poincare-type  Bounds 

Poincare-type  bounds  are  another  type  of  bound  based  on  canonical  paths.  These  also  rely  on  the 
careful  choice  of  a  system  of  canonical  paths  having  small  covering  number. 


20 


CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


For  a  system  of  canonical  paths  T  for  P,  define  the  quantity 

A'(r)  =  max  -i-r  |7z,l*(*Wy)-  O-14) 

Here  again,  the  maximum  is  over  transitions  e,  the  sum  is  over  paths  containing  e,  and  |*y*»  I  denotes 
the  length  of  the  path. 

Theorem  1.23  (Poincare  Bound)  [DS89]  Let  P  be  a  reversible  ergodic  chain ,  and  T  a  system  of 
canonical  paths  for  P,  then 

-  A(r)' 

Proof:  We  will  lower-bound  the  Rayleigh  quotient  in  1.7.  Using  the  fact  that  (<f>,  1),  =  0,  write 

(*,*).  =  \  EW*)  ~  ^(v))J,r(*)T(v)- 

Now  write  the  difference  (<£(x)  -  0(y))  as  a  telescoping  sum  along  the  path  7*,: 

(*(*)- d(y))=  E  V*(«). 


Combining  these  we  proceed 

(*  <t>)*  =  \  E(  H  V<J(e))JT(x)*(y) 

<  sEl7**!  12  (v*(e))**(*Mv)  (11S) 

*  |E<v*W)a  12  h,«»l*(*)T(v) 

«e£ 

<  *(r)  \  E07*(«))’f(«) 

•€* 

=  K(T)(*Lt4)w. 


and  thus  the  quotient  in  1.7  is  bounded  below  by  1/Jf(r),  giving  the  stated  result.  The  inequality 
in  1.15  is  Cauchy-Schwars.  ■ 


The  following  is  a  simple  corollary,  obtained  by  noting  that  if  the  longest  path  7**  €  T  has 
length  f,  then  K(T)  is  related  to  the  weighted  covering  number  by  K(T)  <  lr)( r). 

Corollary  1.24  Xet  P  be  an  reversible  ergodic  chain  admitting  a  system  T  of  canonical  paths  each 
of  length  at  most  lt  and  letrf  =  »j( r).  Then 


1.5.  GEOMETRIC  EIGENVALUE  BOUNDS 


21 


§ 


This  yields  the  following  result  in  the  graphs  realm. 


Corollary  1.25  Let  G  =  (V,  E)  be  a  connected  non-bipariite  graph  with  maximum  degree  d,  admit¬ 
ting  a  system  of  canonical  paths  such  that  no  edge  appears  more  than  b  times  over  all  paths,  and 
such  that  each  path  has  length  at  most  l.  Then  the  random  walk  on  G  has 


Hi  > 


E,  deg  v 

IxPb  ‘ 


Proof:  Using  the  same  reasoning  as  in  Theorem  1.19  to  bound  rj,  apply  the  previous  theorem.  0 


Remark:  The  ideas  behind  the  discrete  Poincare  bounds  are  due  to  Diaconis  and  Stroock  [DS89]. 
The  proof  above  is  translated  into  our  language.  Unlike  in  Jerrum  and  Sinclair’s  Cheeger-type 
inequality,  the  paths  here  enter  the  proof  in  a  fundamental  way;  a  key  step  is  to  write  4>{x)  -  <f>(y)  as 
a  telescoping  sum  of  the  gradient  along  the  path  from  *  to  y.  These  bounds  often  give  better  results 
than  the  Cheeger-type  bounds,  based  on  the  same  system  of  canonical  paths.  Note  that  there  is  no 
squaring  involved  here,  as  arises  in  the  Cheeger  bound.  Using  a  given  system  of  canonical  paths, 
and  the  versions  of  the  theorems  involving  r),  the  Poincare-type  bound  will  give  a  better  result  than 
the  Cheeger-type  bound  if  and  only  if  •  <  Srj,  and  in  practice,  this  seems  almost  always  to  be  the 
case.  However,  it  may  happen  that  one  has  a  bound  on  c  or  k  without  knowing  a  good  system  of 
canonical  paths,  so  that  a  Cheeger-type  bound  may  be  available,  though  no  Poincare-type  bound  is 
evident.  This  is  the  case  with  some  of  our  later  bounds  for  the  eigenvalues  of  Cayley  graphs.  O 


Example  1.26  (Hypereube)  Take,  again,  the  hypercube  with  self-loops.  Using  the  same  system 
of  canonical  paths,  as  in  Example  1.22,  we  have  6  =  2*"1,  l  =  d,  and  the  graph  is  (d  +  l)-regular. 
So  the  Poincare  bound  for  the  walk  on  the  hypeTcnbe  gives  Mi  >  •  ^his  is  better  than  the 

conductance  bound  by  a  constant  factor,  but  is  still  of  the  same  order.  It  is  still  a  factor  of  d  from 
the  right  value  mi  =  We  should  note  that  the  Poincare  bound  con  be  used  in  another  way  to 
get  a  bound  of  the  form  mi  =  n(l/d).  Project  the  walk  on  the  hypercube  to  the  line  of  length  d 
by  recording  the  distance  from  the  origin.  Given  the  distance  from  the  starting  point,  the  walk  on 
the  hypercube  is  uniformly  distributed  on  the  points  at  that  distance.  Carefully  calculating  with 
the  resulting  non-uniform  edge  probabilities  it  is  possible  to  get  a  bound  on  mi  that  i*  accurate  to 
within  a  constant  factor.  See  [DS89].  □ 

The  ideas  of  the  previous  sections  can  be  combined  to  give  a  general  bound,  which  is  of  limited 
interest  to  us  here,  but  can  be  applied  to  such  problems  as  covering  times  and  universal  traversal 
sequences.  (See  (BK88]  [AKL+79]). 

Theorem  1.27  (General  Bounds)  Let  G  be  a  connected  non-bipartite  graph  with  maximum  de¬ 
greed  and  diameter  A.  Let  P  be  the  natural  random  walk  on  G.  Then  the  smallest  positive  eigenvalue 


22 


CHAPTER  L  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


Hi  of  L  —  I  -  P  satisfies 


Mi  >  max 


{ 


deg  « 

d? A|V|J  ’ 


V  j  J  ’ 


Proof:  Use  any  system  of  shortest  paths  as  the  canonical  paths.  Then  i  =  A,  and  6  <  |V|2.  Apply 
the  Checger  and  Poincare  bounds.  E3 


1.6  Probabilistic  Bounds 

We  have  been  talking  exclusively  about  spectral  bounds  on  the  convergence  rate.  There  are  also 
some  purely  probabilistic  techniques  for  bounding  the  convergence  rate.  These  techniques  are  at¬ 
tractive  because  they  involve  only  elementary  and  ‘algorithmic*  reasoning,  and  give  a  more  intuitive 
description  of  time  to  stationarity.  However,  they  also  have  drawbacks.  First,  they  seem  to  require 
a  certain  cleverness  or  the  use  of  rather  special  structure  in  order  to  yield  good  bounds.  Second,  in 
our  applications  it  is  often  helpful  to  know  the  spectrum,  and  it  is  not  known  how  to  recover  this 
information  accurately  from  the  probabilistic  bounds. 

Let  P  be  a  Markov  chain  on  a  finite  state  space  V.  Let  5°°  denote  the  set  of  all  infinite  sequences 
from  the  state  space  5.  For  an  element  a  €  S**,  let  a[  1  :  ib]  denote  the  initial  segment  consisting  of 
the  first  k  elements  of  the  sequence.  A  stopping  time  T  for  P  is  a  function  T  :  5*°  — »  N  U  {oo}, 
such  that  if  7(a)  =  k  and  a'[l  :  k]  =  a[  1  :  Jfe]  then  T (o^)  =  k  as  well.  In  other  words,  if  T  assigns 
a  value  ib  to  a  given  sequence  of  states  a  then  it  does  so  to  all  sequences  sharing  the  same  initial 
segment  up  to  k.  Intuitively,  T  determines  a  value  k  that  only  depends  on  observing  the  sequence 
up  to  time  k.  We  regard  T  as  a  statistic  on  5°°  under  the  infinite  product  measure  given  by  the 
xjt’s,  and  we  may  then  speak  of  the  distribution  of  the  random  variable  7. 

The  techniques  discussed  in  the  following  two  sections  axe  based  on  the  construction  of  stopping 
time  random  variables  T  with  the  property  that  the  variation  distance  ||x*  —  x||  of  the  ibth-step 
distribution  of  the  walk  from  the  stationary  distribution  is  bounded  by  Pr{7  >  £},  the  mass  in  the 
"tail”  of  the  distribution  of  7  after  ib. 

1.6.1  Couplings 

Let  2  be  an  ergodic  Markov  chain  with  state  space  V,  and  transition  matrix  P.  Let  xo  be  its  initial 
state  distribution,  let  x*  denote  the  distribution  of  its  state  at  time  k  (i.e.,  x*  =  x0P*)t  and  let  x 
be  its  unique  stationary  distribution. 

A  (Markovian)  coupling  (^y)  for  Z  is  a  Markov  process  (X,y)  on  V  x  V  together  with  a 
stopping  time  T  for  (X,y)  such  that: 

(Cl)  The  projections  X  and  Y  have  transition  matrix  P,  identical  to  Z%  on  V. 


1.6.  PROBABILISTIC  BOUNDS 


23 


(C2)  Y  is  started  in  a  state  chosen  according  to  the  stationary  distribution,  ir$  =  r.  X  is  started 
with  the  initial  distribution  of  Z ,  t*  =  xo- 

(C3)  T  is  a  stopping  time  for  {X,Y)  that  also  satisfies:  if  T  -t  then  Xk  =  Y*  for  all  k  >  t .  The 
coupling  is  proper  if  T  is  finite  with  probability  1.  We  will  generally  use  ‘coupling’  to  mean 
‘proper  coupling.’  If  it  occurs  that  T  =  t,  the  coupling  is  said  to  have  succeeded  at  time  t. 

Note  that  in  the  process  (X,  Y),  the  projections  X  and  Y  need  not  be  independent,  and  in 
general  won’t  be.  The  requirement  that  the  pair-process  (X,  Y)  be  Markovian  can  be  removed,  as 
long  as  the  projections  remain  Markovian  like  Z .  Thorisson  [Tho86]  has  noted  that  the  requirement 
that  X  and  Y  meet  and  remain  identical  from  time  T  onwards  can  be  weakened  to  the  requirement 
that  X  and  Y  become  and  remain  identically  distributed  from  that  time  on,  without  substantially 
changing  the  proof  of  the  following  theorem.  Also,  one  can  always  adjust  the  coupling  to  ensure 
that  they  remain  together  once  they  meet. 

Theorem  1.28  (Coupling  Inequality)  Let  Z  be  a  Markov  process  on  V  with  stationary  distri¬ 
bution  x,  and  let  (A,  Y)  be  a  coupling  for  Z  with  coupling  time  T.  Let  x*  be  the  distribution  of  the 
state  of  Z  after  k  steps.  Then  for  all  k  >  0,  we  have 

ll**-*||<pr{r>*}. 

Proof:  Let  A  be  any  subset  of  5.  Let  zkt  xfc  and  yki  denote  the  states  of  Z ,  X ,  and  Y,  respectively, 
after  the  Jbth  step.  Then  we  have 

|x*(A)-x(A) |  »  \?x{zk  £  A}  ~  r(A)\ 

=  |Pr{xfc  €  A}  -  Pr{yk  £  A}\ 

=  |Pr{x*  £  A  A  k  >  T}  +  Pr{xfc  £  A  AT  >  k} 

-  (Pr{y*  €  A  A  k  >  T}  +  Pr{yfc  €  A  A  T  >  *})| 

=  |Pr{xfc  €  A  A  T  >  Jfe}  -  Pr{yjt  £  A  AT  >  k}\ 

<  |  max(Pr{xfc  €  A  A  T  >  Jfe},  Pr{y*  £  A  AT  >  Jk»| 

<  Pr{T  >  k}. 

In  the  second  line  we  have  used  the  fact  that  the  distribution  of  X  and  Z  are  identical,  and  the  fact 
that  Y  is  started,  and  thereby  remains,  in  the  stationary  distribution.  In  the  fourth  line  we  use  the 
fact  that  X  and  Y  coincide  from  time  T  onwards.  0 

Couplings  have  their  origins  in  proofs  of  ergodic  theorems  for  more  general  stochastic  processes, 
rather  than  in  convergence  rate  proofs  for  Markov  chains.  Showing  the  existence  of  a  stationary 
distribution  for  a  process,  and  displaying  a  coupling  with  Pr{T  >  k }  going  to  sero  with  increasing  k, 
gives  an  ergodic  theorem.  (See  Griffeath  [Gri78]  for  a  survey  of  coupling  methods  for  Markov 
processes  on  general  state  spaces.)  For  our  purposes,  we  want  our  processes  to  have  an  easily 
analysable  coupling  time  that  closely  matches  the  actual  rate  of  convergence. 


24 


CHAPTER  1.  TECHNIQUES  FOR  BOUNDING  CONVERGENCE  RATES 


GrifTcath  [Gri75]  shows  that  there  always  tzisU  a  maximal  coupling,  one  in  which 

||ir*-ir||  =  Pr{T>*} 

for  all  Jb ,  thus  achieving  equality  in  the  Coupling  Inequality.  Unfortunately,  constructing  such  a  cou¬ 
pling  usually  requires  more  knowledge  of  the  chain  than  we  have.  However,  we  can  sometimes  make 
use  of  the  fact  that  such  couplings  exist,  without  actually  constructing  one.  (See,  e.g.  Theorem  2.8.) 
Here  are  some  simple  examples  of  couplings.  We  give  some  new  coupling  bounds  in  Chapter  2. 

Example  1.20  (Line)  We  consider  again  the  random  walk  P  on  the  n-segment.  (See  Exam¬ 
ple  1.21.)  Our  strategy  for  (X,  y)  is  to  run  the  chain  A  normally  like  P,  and  to  mimic  A'  with 
y.  Let  X  start  out  at  the  left  end-point  and  Y  at  a  uniformly  chosen  point.  Ch'  a  move  for 
X  normally.  If  X  moves  right,  move  Y  right  (or  at  the  right  endpoint  take  the  self-loop).  If  X 
moves  left,  move  Y  left  (or  at  the  left  endpoint  take  the  self-loop).  This  clearly  makes  X  and  Y 
each  behave  like  P.  Notice  that  when  the  X  and  Y  meet,  they  remain  together.  Let  T  be  the  time 
they  first  meet.  This  clearly  gives  a  coupling.  To  bound  the  tails  on  T,  note  that  by  following  this 
strategy  for  (A,  y),  X  can  never  move  to  the  right  of  Y .  So  that  in  particular,  by  the  first  time  Y 
reaches  the  left  boundary,  A'  and  Y  must  have  met.  It  is  a  well-known  fact  that  the  random  walk 
on  such  a  segment  hits  the  left  boundary  in  0(n 2)  steps.  See  [Fel70,  Vol  I.,  Ch.  XIV]  for  a  detailed 
analysis  of  this  time.  Note  that  this  is  the  ‘right  answer,*  as  we  show  in  Chapter  2.  □ 

Example  1.30  (Convergence  of  a  Fixed  Finite  Chain)  We  can  get  a  proof  of  the  convergence 
portion  of  the  basic  ergodic  theorem  for  Markov  chains  as  follows.  Let  P  be  an  ergodic  Markov 
chain.  We  will  show  that  if  x  is  any  stationary  distribution  of  P,  then  the  chain  started  with  any 
distribution  converges  to  x.  This  will  also  imply  uniqueness  of  x.  Recall,  P  is  ergodic  means  that 
there  is  a  Jb0  such  that  Pjj*  >  0  for  all  k  >  k0*  Let  p  >  0  be  the  smallest  entry  in  P*°.  Now 
consider  the  following  coupling  (A,  y).  Choose  an  initial  value  for  A"  according  to  any  distribution, 
and  choose  an  initial  value  for  Y  according  to  x.  Since  x  is  stationary  Y  will  remain  so  distributed. 
Now  run  both  (A,  Y)%  each  independently  according  to  P.  Let  T  be  the  first  time  they  meet.  Now 
after  ko  steps,  the  point  A  is  in  some  state  x,  and  the  probability  that  Y  is  in  the  same  state  x  is 
at  least  p.  It  follows  from  the  Coupling  Inequality  that  ||x*0t  -  x]|  <  Pr{T  >  M}  <  (1  -  p)*-  Thus 
x*  converges  to  x  and  immediately  “exponentially  fast.*  This  example  is  somewhat  misleading  on 
two  accounts.  Here  the  chain  is  fixed;  p  will  not  in  general  be  constant  in  a  family  of  chains  for 
which  the  vertex  set  grows.  Furthermore,  spectral  analysis  is  usually  required  anyway  to  get  any 
reasonable  handle  on  p  (or  anything  like  it).  □ 

1.6.2  Strong  Stationary  Times 

A  strong  stationary  time  for  a  Markov  chain  {A*}  with  stationary  distribution  x  is  a  stopping 
time  T  such  that 


Pr{Ajk  €  A  |  T  <  k}  ss  x(A). 


2.6.  PROBABILISTIC  BOUNDS 


25 


When  r  is  the  uniform  distribution  on  the  state  space  we  also  call  T  a  strong  uniform  time.  For 
background  and  additional  material  see  [AD85]  and  [DF88]. 

Essentially  the  same  proof  as  that  of  the  Coupling  Inequality  will  verify  that  a  strong  stationary 
time  satisfies  the  same  inequality 

lin-*i!<Pr{r>*}. 


Indeed  strong  stationary  times  can  be  considered  a  special  type  of  coupling. 

In  the  context  of  strong  uniform  times,  it  is  somewhat  more  natural  to  measure  distance  between 
distributions  by  the  separation,  defined  by 


sep  (PyQ)  =  max 

*  €  s 


<?(»)  -  Pif) 
<?W 


(Separation  is  discussed  further  in  Appendix  A.)  This  is  because  strong  stationary  times  satisfy  the 
stronger  inequality 

II**  “  *11  <  *ep(**.*)  <  p*{r  >  k }• 

Just  as  there  always  exists  a  maximal  coupling,  there  is  always  a  maximal  strong  stationary 
time,  one  which  achieves  equality  here.  However,  since  we  may  have  |(wfc  -  *||  <  sep(?r*,x),  strong 
uniform  times,  in  general,  cannot  be  maximal  couplings.  (However,  when  the  Markov  chain  is  a 
random  walk  on  a  group,  one  can  show  that  if  the  variation  gets  small  after  k  steps,  the  separation 
is  small  after  at  most  roughly  2k.  So,  roughly  speaking,  a  maximal  strong  stationary  (uniform)  time 
for  a  random  walk  on  a  group  can  only  be  ofT  by  about  a  factor  of  two.  For  a  precise  statement  and 
proof  see  [AD86,  Proposition  (5.13),  p.  15].) 


Example  1.81  (Top-in-at-Random  Shuffle)  Suppose  a  deck  of  n  cards  is  shuffled  by  the  follow¬ 
ing  procedure.  Take  the  top  card,  and  place  it  in  a  uniform  random  position  in  the  deck  (possibly 
back  on  top).  This  defines  a  doubly  stochastic  Markov  chain.  How  long  does  it  take  to  converge  to 
uniformity?  Let  T  be  the  first  time  the  card,  call  it  B,  that  was  originally  on  the  bottom  comes  to 
the  top  and  has  just  been  re-inserted.  T  is  a  strong  uniform  time.  This  can  be  proved  by  induction 
as  follows.  Note  that  the  card  B  rises  only  by  having  a  card  inserted  below  it.  The  relative  order  of 
the  cards  below  B  is  uniform  and  each  time  a  new  card  is  inserted,  this  property  is  preserved.  Thus, 
when  finally  B  reaches  the  top  and  is  re-inserted,  all  of  the  cards  are  in  random  relative  order.  To 
analyse  T,  note  that  T  =  £,<<<„  7i,  where  T,  is  a  g  ^metric  random  variable  with  parameter  i/n. 
This  tells  us  that  E[T]  =  nEn  =  n(ln  n+7-j-o(l)).  The  tail  after  this  point  goes  down  exponentially 
and  this  can  be  seen  easily  using  a  coupon  collecting  argument.  (See  Chapter  2.)  □ 


Chapter  2 


Urn  Models  and  Group  Actions 


Urn  transfer  models  are  processes  in  which  the  states  are  configurations  of  balls  in  urns,  and  whose 
evolution  proceeds  according  to  Markovian  transfer  rules  for  the  balls  amongst  the  urns.  Such 
processes  often  arise  as  models  of  diffusion  and  as  particle  models  in  statistical  mechanics.  Various 
applications  are  described  in  [JK77J. 

While  it  i  easily  seen  that  any  finite-state  Markov  chain  can  be  viewed  in  a  trivial  way  as  a 
single-ball  urn  model,  one  gets  a  more  natural  class  of  models  by  insisting  that  the  transfer  rules 
have  certain  symmetries.  In  particular,  we  will  require  that  they  correspond  to  the  action  of  the 
generators  of  some  group. 

In  this  chapter,  we  give  some  upper  bounds  on  the  time  to  stationarity  of  such  processes.  We  first 
present  methods  of  dealing  with  two  well-known  urn  models.  We  then  give  general  diameter-based 
bounds  on  the  Laplacian  eigenvalues  and  rate  of  convergence  for  certain  Markov  chains  based  on 
group  actions. 


2.1  Ehrenfest-type  Models 

2.1.1  Arbitrary  Steps 

Consider  a  system  of  n  labelled  balls  distributed  amongst  m  labelled  urns,  where  the  configuration 
of  the  system  evolves  in  discrete  time  according  to  the  following  rule.  At  each  time  step,  one  of  the 
n  balls  is  chosen  uniformly  at  random,  and  this  ball  ‘jumps’  from  the  urn  it  is  in  to  a  new  location 
chosen  uniformly  at  random  amongst  the  urns  (including  the  one  in  which  the  ball  started).  We 
call  this  chain  the  Ehrenfest-(n,m)  process  after  the  physicists  who  proposed  the  m  =  2  case  to 
model  certain  bi-valued  properties  of  particles  [EE07,  Fel70]. 

This  process  is  a  symmetric  ergodic  Markov  chain.  By  associating  to  each  ball  the  number  of 
the  um  it  is  in,  we  obtain  a  correspondence  of  the  states  of  the  system  to  ordered  n-tuples  where 


26 


2.1.  EHRENFEST-TYPE  MODELS 


27 


each  coordinate  is  drawn  from  the  set  [m].  Using  the  standard  Carterian  prodnet  notation  for  sets, 
we  denote  this  V  =  [m]n.  The  stationary  distribution  is  uniform  on  the  state  space. 

Let  Gi  =  (Vi.^x)  and  G2  =  (V2,£2)  be  graphs  (in  our  general  sense)  with  |Vi|  =  nt  and 
|Vr2|  =  nj.  Let  Ai  and  A:  be  their  adjacency  matrices.  The  Cartesian  product,  G  =  Gi  x  Gj,  is 
the  graph  whose  vertex  set  is  Vj  x  V2  and  whose  adjacency  matrix  is  given  by 

A\  ®  /n*  *4*  /nj  ® 

where  /„  denotes  the  identity  matrix  of  order  n  and  0  denotes  the  Kronecker  product.  (So  A,  0/Bl 
is  the  n2  x  n2  matrix  obtained  by  placing  n2  ‘copies’  of  Ai  along  the  diagonal,  and  /«,  ®  M  “ 
obtained  from  A 2  by  replacing  a,,  by  a  diagonal  block  of  r»i  copies  of  a,,.)  Informally,  the  product 
graph  may  be  constructed  by  first  making  a  copy  G,  of  Gi  for  each  vertex  in  u  €  V2,  and  joining 
each  corresponding  pair  of  vertices  in  G»  and  G„,  if  (v,w)  is  an  edge  in  G2.  Multiple  edges  and 
self-loops  are  copied  with  their  corresponding  multiplicities.  The  Cartesian  product  is  an  associative 

operation  on  graphs. 

The  Ehrenfest- (n,m)  process  may  be  viewed  as  the  natural  random  walk  on  the  nm-regular  graph 
obtained  by  adding  n  self-loops  to  each  vertex  of  JT",  the  Cartesian  product  of  n  cliques  of  sire  m. 
When  m  =  2,  this  is  the  n-dimensional  hypercube  with  n  self-loops  at  each  vertex.  Diaconis  [Dia88] 
gives  arguments  to  deal  with  this  case.  In  this  section  we  generalise  to  larger  m  using  a  different 
method. 

Lemma  2.1  (Laplacian  Spectrum  of  Cartesian  Products)  Lei  H  =  (VH,EK)  and  K  - 
(Vk,Ek)  be  two  graphs.  Let  G  =  E  x  K  be  their  Cartesian  prodnet.  Then  the  eigenvalues  of 
Q(G)  are  the  elements  of  the  multiset  of  all  possible  sums  vH  +  vK,  where  vH  is  an  eigenvalue  of 
Q(H)  and  vK  is  an  eigenvalue  ofQ(K).  In  particular  i>i(G)  =  min{i/i(F), 

Proof:  The  statement  follows  easily  from  the  structure  of  the  matrix  Q  =  D  -  A  for  the  product 
graph  G  in  terms  of  the  corresponding  matrices  for  the  components  Gt  and  Gj.  Again,  let  n2  =  |Vj|, 
n2  =  |V2|,  and  let  /„  denote  the  identity  matrix  of  order  n.  One  can  verify  that 

<?(G)  =  <?(Gi)  0  +  In ,  ®  <?(Gj). 

The  result  then  follows  from  the  well-known  theorem  of  Kronecker  that  for  square  matrices  A  and 
B,  of  order  o  and  b  respectively,  the  eigenvalues  of  the  product  A  0  B  are  the  oh  possible  products 
of  eigenvalues,  one  from  A  and  one  from  B.  (See  e.g.  [MM64,  p.  24,  Prop.  2.15.11].)  Q 

Theorem  2.2  (Eigenvalues  of  the  Ehrenfest  Process)  Let  P  be  the  Ehrenfest-(n,m)  process. 
Let  L  =  I  -  P  be  the  associated  Laplacian.  The  eigenvalues  of  L  are  the  values 

1,  0  <  j  <  n,  with  multiplicity  ~ 


28 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


Proof:  First  consider  Km  the  m-clique.  For  this  graph  we  have  Q(A'm)  s=  ml  -  /,  where  J  is 
the  m  x  m  matrix  with  unit  entries  throughout.  The  characteristic  polynomial  is  det(<J  -  zl)  = 
i(n  -  z)n~l.  Thus  its  eigenvalues  are  0  and  m,  where  m  occurs  with  multiplicity  m  -  1.  It  follows 
from  the  preceding  lemma  that  the  spectrum  of  Q(A'* )  is  the  set  of  values  mj,  0  <  j  <  n,  where 
the  value  mj  occurs  with  multiplicity  (”)(m-  1  )*,  and  that  the  eigenvalues  of  L(P)  =s  ~~Q{G)  are 
-Jj  =  J  with  the  same  multiplicities.  II 


Corollary  2.5  (Full- spectrum  bound  for  the  Ehrenfest  process)  For  any  initial  distribution, 
let  xk  be  the  kth-step  distribution  of  the  EhrenfesU(n,  m)  process.  If  k  >  |(ln  n  +  In  m  +  c)  for  any 
c  >  0,  we  have 


Proof:  Apply  the  previous  result  and  Theorem  1.9.  Taking  k>  j(ln  n  +  In  m  +  c),  one  gets 

li**-vii#  <  \  £  (")(*»■ -i)>(i-i)n 

1  A  (nm)>  I4j./n 
■  >'■ 

1  A  c~,e 

<  ;(<•"  - 1) 


Here,  in  moving  to  the  second  line  we  have  used  the  fact  that  (!-*)<  e~*,  and  in  the  last  line,  we 
use  the  fact  that  the  sum  is  an  initial  segment  of  the  series  expansion  for  e*~*  —  LB 


For  m  =  2  the  bound  is  essentially  tight  [Dia88].  For  n  fixed  and  m  growing  the  bounds 
obtained  by  the  preceding  technique  grow  with  m.  In  actuality,  however,  the  time  to  itationarity 
is  independent  of  m,  as  is  shown  in  the  next  theorem,  by  an  easy  strong  uniform  time  bound. 
Knowledge  of  the  spectrum,  however,  is  useful  in  applications,  such  as  in  Chapter  3. 

Theorem  2.4  (Uniform  Time  for  Ehrenfest- (ntm))  Consider  tfce  Ehrtnft$t*(n,  m)  process, 
and  let  T  be  the  first  time  each  ball  has  been  chosen  at  least  once  to  move .  Then  T  is  a  strong 
uniform  time . 


Proof:  Mark  the  coordinate  corresponding  to  a  ball  immediately  after  it  is  first  chosen  to  move. 
The  value  in  the  first  marked  coordinate,  upon  being  marked,  uniformly  distributed  over  its  m 
possibilities  by  the  definition  of  the  process.  Similarly,  as  a  simple  induction  shows,  the  restriction 
of  the  state  to  the  marked  coordinates  is  always  uniformly  distributed,  and  independent  of  the  time 


2.1.  E3RESFEST-TYPE  MODELS 


29 


thf  set  was  so  marked.  It  follows  that  the  first  time  all  coordinates  are  marked  is  a  strong  uniform 
time  for  the  process.  M 

The  analysis  of  the  above  strong  uniform  time  is  the  same  as  that  of  the  “coupon  collector’s 
problem.  An  easy  bound  can  be  obtained  as  follows. 

Corollary  2.5  Let  rk  be  the  kih-step  distribution  of  the  Ehrenfest-(m,n)  process.  For  any  initial 
distribution  v*  Have 

||*jk  —  U ||  <  e~c  whenever  k  >  nlnn  4*  cn. 

Proofs  Let  Etk  denote  the  event  that  after  k  steps,  the  coordinate  i  remains  unmarked.  The  event 
{T  >  k}  is  clearly  Ui<«<»£>*-  So 

Pr{T  >  it}  <  Pr{£,t}  =  nPr{£u}, 

!<»<n 

since  the  probability  of  a  union  of  events  is  bounded  by  the  sum  of  their  probabilities,  and  each  of 
the  Eik  is  equiprobable.  Now,  the  probability  that  a  given  coordinate,  say  coordinate  1,  is  unmarked 
after  step  k  is  the  probability  it  has  not  been  chosen  by  step  k,  which  is  (1  -  £)*•  Whence 

Pr{T  >k)<  n(l  -  i)*  =  <  el 

from  which  the  inequality  follows  immediately.  □ 

Remark:  As  n  grows,  the  coupling  time  T  is  asymptotically  Poisson  with  mean  n.  For  n  growing 
we  have 

Pr{T  <  nlnn  4-  cn}  =  e“e  4*  o(l). 

D 

Remark:  The  number  of  self  loops  affects  the  holding  probabilities,  and  therefore  the  convergence 
rate  bounds,  but  not  significantly.  The  spectral  argument  may  be  used  with  minimal  change  to 
bound  the  walk  on  the  graph  for  any  number  of  self-loops,  provided  the  graph  remains  regular.  For 
m  —  2  some  self  loops  are  necessary  to  avoid  periodicity.  For  m  >  3,  the  walk  on  the  graph  is 
aperiodic  without  the  need  to  add  self-loops.  The  strong  uniform  time  argument  is  more  sensitive 
to  the  number  of  self-loop  edges.  There  is  a  simple  coupling  with  the  same  convergence  rate  for  the 
case  of  a  single  self-loop  at  each  vertex. 

Broder  [Bro],  and  Aldous  and  Diaconis  [AD86]  have  suggested  similar  strong  uniform  times  for 
the  special  case  of  m  =  2,  when  the  holding  probability  at  each  state  is  1/2.  The  extension  here  to 
larger  m  is  straightforward,  if  that  is  known.  This  result  was  obtained  independently  around  the 
same  time  (1986).  Q 


i 


/ 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


2.1.2  Adjacent  Moves 

An  interesting  variant  of  the  Ehrenfest  model  is  the  adjacent-move  model.  Here,  as  before,  we 
have  n  labelled  balls  and  m  labelled  urns.  A  ball  is  chosen  uniformly;  say  it  is  currently  in  urn  i.  A 
value  in  d  €  {— 1,  + 1}  is  chosen  uniformly,  and  the  ball  moves  to  urn  i  +  d  if  that  urn  exists  (i.e., 
1  <  *  +  d  <  m).  Otherwise  it  remains  in  the  same  position. 

This  leads  to  the  natural  random  walk  on  a  2n-regular  graph  given  by  the  Cartesian  product 
of  n  m-segments,  where  each  m-segment  has  a  single  self-loop  at  its  two  ends.  (See  our  earlier 
Examples  1.21  and  1.29.)  The  stationary  distribution  is  uniform  on  V  s  [m]n. 

We  first  consider  the  walk  on  the  m- segment.  The  eigenvalues  of  this  transition  matrix  are  known 
classically.  They  are  cos(xi/m),  for  0  <  s  <  m  —  1.  (See  [Fel70,  Vol.  1,  Ch.  XVI.3].)  The  unit 
eigenvalue  comes  from  i  =  0. 

Theorem  2.8  (Fall  Spectrum  Bound  for  the  Line)  Let  x*  be  the  kth-step  distribution  of  the 
natural  random  walk  on  the  m-segment.  For  any  initial  distribution  and  any  real  c  >  1,  if  k  >  “j-j- 


Proof:  Theorem  1.9  gives  us 


In  “  V\\*  <  \  ^(c°s(Tj/m))n. 


It  is  not  too  difficult  to  bound  this  sum  using  properties  of  the  cosine. 
The  following  properties  are  handy.  We  have 


cos(*)  <  «-■*/» 


for  0  <  *  <  x/2. 


To  see  this,  let  h(*)  =  ln(cos(*)e**/2)  and  note  A(0)  =  ln(l)  =  0,  h'(x)  =  *  -  tan*  <  0  for 
0  <  *  <  x/2.  So  h(x)  <  h( 0)  in  this  interval,  and  (2.1)  follows.  We  also  use  the  fact  that  for 
1  <  i  <  m  wc  have  cos(xj/m)  =  -  cos(xf/m)  where  i  =  m  -  j,  so  their  even  powers  are  equal. 
Finally  note  that  cos(x/2)  =  0. 

Using  these  properties,  we  have 

.  m  -  l(m-l)/2J 

J  53(C0S(Ti/H)Jfc  =  2  H  cos(xy/m))si 

J=1  i=i 

<  I  y' 

■  s  h 

-  s  u 


2. 1 .  E11RENFEST-  TYPE  M ODELS 


31 


< 


g-k**/ m’ 
2 


)-o 


g-kr'/m* 

~  2(1  - 

The  denominator  is  certainly  greater  than  1  when  k  >  rr?jr'1.  The  result  follows.  S3 


This  bound  is  tight  to  within  a  constant  factor. 

Theorem  2.T  (Lower  Bound  for  the  Line)  Let  xt  be  the  k Ik- step  distribution  of  the  natural 
random  walk  on  the  m-segment,  when  x0  «  anV  P°int  ma))  distribution  on  one  vertex.  For  any 

e>  0,  */*:  <  TefTTTJ  iken 

||xt  —  Lr||  >  1  —  2e~e. 

Proof:  If  k  <  mJ/16(c+  1)  moves  are  taken,  a  standard  Chernoff  bound  (e.g.,  Lemma  3.10) 

insures  that  the~probability  that  we  are  at  distance  at  least  m/4  away  from  the  initial  point  is  at 
most  2e-m>/16t  <  2e~(c+1)  <  e~c.  Rephrasing  this,  the  total  mass  of  xt  on  that  proportion  of  points 
at  distance  more  than  m/4  is  at  most  e*'.  Let  p  <  1/2  be  the  fraction  of  points  at  distance  less 
than  m/4  from  the  initial  vertex.  The  variation  distance  is  at  least  ((l-p)-e_e)  +  ((l-e  e)-p)  = 
2(1  -  p)  -  2e-t  >  1  -  2e“e.  B 

Remark:  Diaconis  [Dia88]  gives  similar  tight  bounds  for  random  walk  on  the  circle  Zm,  m  odd. 
He  obtains  the  eigenvalues  for  that  transition  matrix  by  harmonic  analysis.  They  are  cos(2xt/m) 
for  1  <  » ■:  m  —  1.  Though  they  are  different,  they  give  rise  to  the  same  sum  as  in  our  upper  bound, 
and  from  that  point  we  use  the  same  argument.  Our  lower  bound  argument  is  different  from  his, 
but  a  similar  technique  is  also  suggested  by  Diaconis,  p.  27.  O 

Lemma  2.1  (about  Cartesian  products)  tells  us  the  whole  spectrum  for  the  Ehrenfest  process 
as  sums  of  cosines,  but  the  resulting  sum  in  the  full  spectrum  bound  is  unwieldy.  Lemma  2.1  also 
gives  a  bound  on  pi(P)  alone  which  is  easy  to  apply.  But  this  gives  an  0(n3m2  In  m)  bound  on  the 
number  of  steps  sufficient  to  get  close  to  stationarity.  We  can  get  a  better  result  from  a  coupling 
argument.  The  intuition  is  that  the  projection  to  any  one  coordinate  is  a  walk  on  an  m-segment. 
Moreover,  the  uniform  distribution  in  each  component  induces  the  uniform  distribution  on  the  whole 
product.  It  will  suffice  to  take  enough  steps  to  insure  that  each  component  gets  some  H (ml)  steps. 
The  following  theorem  makes  this  rigorous. 

Theorem  2.8  (Adjacent  Move  Ehrenfest  Process)  Let  P  be  the  adjacent-move  Ehrenfest- 
(n,  m)  process,  withm>A.  for  any  red  c  >  1,  let  d  =  ^  In  2n.  Then  for  any  initial  distribution 
and  any  k  >  (6d  —  l)n  tee  have 

For  fixed  c,  this  gives  on  0(mJn  Inn)  convergence  guarantee. 


CHAPTER  2.  HRiNT  MODELS  AND  GROUP  ACTIONS 


Proof:  It  suffices  to  prove  the  statement  for  any  given  iniual  position  x  €  V  =  [m]". 

We  know,  (see  Section  1.6.1),  that  there  exists  a  maximal  coupling  for  any  Markov  process,  i.e. 
a  coupling  that  achieves  equality  in  the  coupling  inequality.  Write  x  =  (xi»*2i  •  ••!*»)  ***<1  f°r 
1  <  »  <  n,  let  C*  =  (A,,  Bi)  be  a  maximal  coupling  for  the  natural  random  walk  on  the  m-segment, 
where  A*  starts  in  x*,  and  starts  at  a  point  y*  chosen  according  to  the  uniform  distribution  on 
the  m-segment.  Let  T*  be  the  associated  stopping  time  random  variable  for  the  coupling  C*. 

Now  construct  a  ‘full  coupling*  C  =  (X,  y)  for  the  chain  P  as  follows.  Start  X  at  the  initial 
vertex  *  =  (x1(x2l...t*n)  €  V,  and  start  Y  at  y  =  (yi, y2, .. Myn)  which,  by  the  way  the  y<  were 
chosen,  is  uniformly  distributed  on  V.  At  each  time  step  choose  a  coordinate  i  in  [n].  Make  a  move 
in  the  coupling  process  C%  =  (Ai,Ri),  and  duplicate  the  move  of  Ai  in  coordinate  t  of  X ,  and  the 
move  of  B{  in  coordinate  i  of  Y *  Since  C%  is  a  coupling  for  the  natural  random  walk  on  the  line, 
both  Ai  and  Bi  move  according  to  the  trarsition  probabilities  for  natural  random  walk  on  the  line. 
It  is  easy  to  see  from  this  that  X  and  Y  move  according  to  the  transition  probabilities  given  by  P . 
Notice  that  from  the  point  that  C<  has  successfully  coupled  coordinate  i ,  the  processes  X  and  Y 
will  also  agree  thereafter  in  coordinate  i,  since  duplicting  the  moves  of  C%  must  keep  them  coupled. 
Thus  C  =r  (X,  Y)  is  a  coupling  for  P  with  coupling  time  T  =  max*  Tit  and  it  remains  only  to  bound 
the  tail  of  the  distribution  of  T. 

To  do  this,  first  let  d  =  x‘2cm5  In (2n),  and  suppose  that  at  some  time  k  we  have  taken  at  least 
d  moves  in  coordinate  i,  for  some  particular  i.  The  probability  that  X  and  Y  do  not  agree  by  this 
time  in  their  ith  coordinate  is  precisely  the  probability  that  Ci  has  not  coupled,  in  other  words 
Pt{Ti  >  d}.  But  because  C»  is  a  maximal  coupling,  and  by  our  earlier  bounds  for  the  walk  on  the 
m-segment,  Pr{Tt  >  d}  is  at  most  exp(-cln(2n))  =  e“c/2n.  Thus,  if  at  some  time  k  we  have  made 
at  least  d  moves  in  every  coordinate,  the  probability  that  at  least  one  of  the  Ci  has  not  yet  coupled 
is  at  most  e“e/2. 

Now  we  will  show  that  if  we  make  k  >  (6d  -  1  )n  moves  in  the  coupling  C,  that  we  will  have 
moved  at  least  d  times  in  every  coordinate  with  high  probability,  and  thus  we  will  have  coupled  with 
high  probability. 

Suppose  we  make  lb  moves  in  the  full  coupling  C .  Consider  any  one  particular  coordinate  *.  Of 
these  k  moves,  the  number  ki  of  moves  that  take  place  in  coordinate  i  (and  hence  coupling  Ci)  has 
the  Binomial ( lb,  p)  distribution,  where  p  =  1/n.  So,  using  a  standard  Chernoff  bound  (Lemma  3.10), 
we  have 

Pr{i<  <  d}  =  Pr{fci  <d-l}<«p  (-(ip -(ci- l))s/4ip?) , 
where  q  =  1  -  p.  Noting  that  pq  <  p  =  1/n,  and  then  substituting  the  values  of  k,  p,  and  d  we  get 

Pr{i»  <  d}  <  exp  (-(ip  -  (d  -  l))a/4ip?) 

<  exp  (-(ip  -  (d  -  l))*»/4i) 

=  exp  (-(5d)Jn/4i) 


2.1.  EHRENFEST-TYPE  MODELS 


33 


=  exp  (-25d2n/4k) 

=  exp  (-25d2/(24d  —  4)) 

<  exp(-d) 

=  exp  (-T_,cmJ  In  (2n)) 

<  exp  (— cln(2r»)) ,  since  m  >  4  >  * 

<  e“/2n. 

So  the  union  of  events  <  d)<  which  “  to  sa>' tbe  event  that  we  fail  to  make  at  least  d  moves 

in  some  coordinate,  has  probability  at  most  e“c/2. 

Thus  after  k  =  (6 d  -  l)n  moves,  the  probability  that  we  have  not  coupled  in  the  full  coupling 
C  is  at  most  e“c,  because  we  can  only  have  failed  to  couple  in  one  of  two  ways:  (1)  we  made  at 
least  d  moves  in  every  coordinate  but  still  failed  to  couple  in  at  least  one  coordinate,  which  we’ve 
shown  happens  with  probability  at  most  e-c/2,  or  (2)  we  failed  to  make  at  least  d  moves  in  some 
coordinate  and  we  also  failed  to  couple.  This  happens  with  probability  at  most  e"*/2.  The  theorem 
then  follows  from  the  Coupling  Inequality.  E 

Remark:  The  idea  behind  the  coupling  above  can  be  used  more  generally  to  construct  coupling 
bounds  in  Cartesian  product  chains  from  bounds  on  each  component.  The  latter  part  of  the  argument 
is  esrsntially  an  upper  bound  on  the  multiple  coupon  collecting  problem:  Suppose  that  you  are 
given  a  sequence  of  coupons  drawn  from  an  infinite  supply  of  n  different  types  of  coupons,  and 
where  the  type  of  each  successive  coupon  is  independently  and  uniformly  distributed  over  the  n 
types.  How  many  coupons  must  you  be  thrown  in  order  to  get  at  least  d  coupons  of  each  type 
(with  high  probability)?  In  Corollary  2.5,  we  gave  an  answer  to  this  question  when  4=1.  Several 
authors  [NS60,  ER61,  Fla82]  have  given  asymptotic  answers  to  this  and  related  questions  for  the 
case  where  d  is  any  fixed  positive  integer,  and  n  approaches  infinity.  To  illustrate,  let  the  random 
variable  Kt,n  denote  the  number  of  coupons  required  to  get  d  complete  sets  of  the  n  different  types 
of  coupons.  Erdos  and  Renyi  [ER61]  show  that  for  every  fixed  positive  integer  d,  and  every  real  * 
one  has 

lim  Pr {Kt,n  <  nln  n  +  (d  —  l)nlnlnn  +  sr»}  =  exp  jg  —  jjj  J  - 

Their  proof  begins  to  need  patching  near  the  outset  if  we  wish  to  handle  even  d  =  @(log«).  For  our 
problem,  we  in  fact  have  d  =  cT~JmJ  logn,  and  we  want  to  give  guarantees  that  hold  for  finite  n, 
and  not  only  in  the  Emit.  For  the  specific  case  v>e  consider,  our  argument  gives 

Pr{JCa,„>(6d-l)n}<e-e/2. 

For  any  fixed  c  >  0,  this  bound  is  clearly  optimal  to  within  the  factor  6,  since  for  any  d  and  n  it  is 
necessary  to  draw  at  least  dn  coupons  to  have  any  nonsero  probability  of  getting  d  complete  sets. 


/ 

/. 


34 


CHAPTER  2.  URN  MODELS  AND  CROUP  ACTIONS 


However,  our  bound  may  not  be  optimal  as  a  function  of  c.  Further  non- asymptotic  analysis  of  the 
general  multiple  coupon  collector’s  problem  would  be  interesting  and  useful.  □ 


2.2  Bernoulli-Laplace-type  Models 

Let  n  be  a  positive  integer,  0  -  (0u02,---,0m)  »  partition  of  n  into  m  nonrero  parts.  We  assume 
that  n  >  m  >  2,  and  0i  >  Pi  >  •  •  •  >  Pm  >  Q-  Consider  the  following  discrete-time  Markov  process. 
Initially,  n  labelled  balls  are  distributed  into  m  labelled  urns  with  balls  1 ..  .0i  in  urn  1,  the  next 
02  balls  in  urn  2,  and  so  forth.  Then  at  each  time  step,  two  urns  are  chosen  uniformly  at  random. 
From  each  of  the  two  chosen  urns  a  ball  is  chosen  uniformly  at  random,  and  the  chosen  balls  are 
swapped;  each  moves  to  the  other’s  urn. 

We  call  this  the  Bernoulli-Laplace  model  with  parameters  n,  m,  and  0.  Daniel  Bernoulli 
proposed  it  as  a  simple  model  of  diffusion,  and  Laplace  did  some  analysis.  Similar  models  appear 
in  other  contexts  within  statistical  mechanics.  [Fel70,  Vol.  I,  pp.  39ff,  188ff] 

The  state  space  of  the  process  is  naturally  identified  with  the  set  partitions  of  [n]  into  m  subsets 
with  ft  elements  in  the  ith  subset.  The  state  may  also  be  associated  with  a  Young  tableau  of  fixed 
“shape”  ft  in  which  the  contents  1, 2, . . n,  are  decreasing  in  each  row,  but  obey  no  order  condition 
in  the  columns. 

The  process  is  a  symmetric  ergodic  Markov  chain.  The  distribution  of  the  state  therefore  con¬ 
verges  to  the  uniform  distribution  on  the  state  space.  How  many  steps  are  required  in  order  that 
the  true  distribution  of  the  process  be  close  to  the  uniform  stationary  distribution?  This  question 
has  been  answered  with  tightest  possible  bounds  by  Diaconis  and  Shahshahani  in  the  classical  case 
when  m  =  2  [DS87].  (In  an  earlier  result  they  also  handled  the  the  process  that  results  if  we  allow 
n  s  m  [DS81],  which  gives  a  random  walk  on  the  set  of  permutations  of  [n]  generated  by  applying 
random  transpositions  Here  we  deal  only  with  the  model  where  n  >  m.) 

The  techniques  used  here  provide  elementary  arguments  matching  the  existing  bounds  when 
m  =  2,  and  generalising  to  m  >  2.  However  they  are  weak  for  these  larger  m.  We  give  a  coupling 
for  the  general  process,  and  give  bounds  on  the  time  to  stationaxity  in  the  two-urn  and  many-urn 
cases. 


2.2.1  A  Coupling  for  Bernoulli-Laplace  Processes 

Let  2  be  a  Bernoulli-Laplace-(n,mt£)  process.  We  describe  a  coupling  (X,  Y)  for  2 . 

1.  Let  0  =  (A, ft,..., A*).  Start  X  with  the  first  ft  balls  in  the  first  urn,  the  next  ft  balls 
in  the  second  urn,  and  so  forth.  Start  Y  in  a  state  chosen  uniformly  from  its  possible  states. 
Call  a  ball  in  either  process  coupled  to  its  counterpart  in  the  other  process  if  both  are  in 
corresponding  urns  in  the  two  processes. 


35 


2.2.  BERNOULLI- LAPLACE-TYPE  MODELS 

2.  Choose  two  urns  {ti, «}  uniformly  at  random  from  the  (”)  such  pairs.  Choose  the  correspond¬ 
ing  two  urns  in  process  Y. 

3.  Choose  a  ball  a  at  random  from  urn  u  in  process  JY.  If  ball  a  is  coupled,  choose  the  same  ball 
a'  =  a  from  urn  u  in  process  Y.  If  ball  a  is  uncoupled,  choose  a  ball  a!  uniformly  from  the 
uncoupled  balls  in  urn  u  of  process  Y . 

4.  Choose  a  ball  b  at  random  from  urn  v  in  process  X.  If  ball  b  is  coupled,  choose  the  same  ball 
b’  =  b  from  urn  v  in  process  Y.  If  ball  b  is  uncoupled,  choose  a  ball  b'  uniformly  from  the 
uncoupled  balls  in  urn  v  of  process  Y . 

5.  Swap  balls  o  and  b  in  process  X.  Swap  balls  o'  and  6'  in  process  Y . 

6.  Let  T  be  the  first  time  that  all  balls  are  coupled. 

Theorem  2.9  The  procedure  above  gives  a  proper  coupling  (X,Y)  for  ihe  (n,m,0)  Bemoulh- 
Laplace  process,  for  n  >  m  >  2. 

Proof:  It  is  clear  that  the  process  X  is  identical  to  Z,  and  that  Y  starts  in  the  stationary 

distribution  of  Z.  We  must  show  that  the  transition  matrix  of  Y  is  identical  to  Z,  and  that  Y 
eventually  meets  X  with  probability  1,  coinciding  thereafter. 

To  see  that  Y  moves  with  the  same  transition  probabilities  as  Z ,  notice  that  (a)  the  urns  are 
chosen  with  the  same  probabilities  as  for  Z,  and  (b)  when  a  ball  is  chosen  from  an  urn,  either  it  was 
coupled  and  the  corresponding  ball  was  chosen  in  X,  or  it  was  uncoupled  and  was  chosen  uniformly 
from  amongst  the  uncoupled  balk  in  the  same  urn  in  Y.  Thus,  a  ball  within  a  chosen  urn  »  is  picked 
with  probability  if  it  is  coupled,  while  if  it  is  not  coupled  it  is  chosen  with  probability  Lx—  =  —t 
where  k  is  the  number  of  uncoupled  balls  in  urn  *.  In  both  cases  the  ball  is  chosen  with  the  correct 
probability. 

To  see  that  Y  eventually  meets  X  with  probability  1,  first  notice  that  once  a  ball  is  coupled  it 
remains  coupled,  and,  as  we  show  in  the  next  pargtaph,  there  is  always  a  finite  sequence  of  moves, 
having  nonzero  probability,  that  reduces  the  number  of  uncoupled  balls.  It  follows  that  there  is 
a  finite  sequence  of  moves,  with  nonzero  probability,  making  X  and  Y  coincide.  Therefore  they 
eventually  meet  with  probability  1. 

Assume  for  the  moment  that  a lift  are  at  least  2.  Let  o  be  an  uncoupled  ball  in  some  urn  u  of 
X.  Then,  by  the  definition  of  a  “coupled”  ball,  a  resides  in  some  other  urn,  v,  of  Y.  Since  all  ft 
are  at  least  2,  there  is  at  least  one  other  ball  b'  in  um  v  of  Y.  If  b'  is  coupled,  let  b  be  its  mate  in 
um  v  of  X,  otherwise  let  b  be  any  uncoupled  ball  in  um  v  of  X.  Let  o'  be  any  uncoupled  bell  in 
um  u  of  Y .  (There  must  be  at  least  one,  since  there  is  one  uncoupled  ball,  namely  a,  in  um  u  of 
X.)  Now  the  move  that  swaps  a  and  6  in  X,  and  o'  and  b'  in  Y,  couples  the  previously  uncoupled 
balls  labelled  o  (and  possibly  others),  while  leaving  coupled  all  previously  coupled  balls.  A  similar 


36 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


sequence  of  two  moves  works  even  when  only  0i  >  2  (guaranteed  by  n  >  m),  by  first  getting  a  to 
urn  1  in  Y,  and  then  doing  the  move  just  described.  □ 

2.2.2  Tight  Analysis  of  the  Two-Urn  Case 

The  time  to  stationarity  in  the  two-urn  case  is  easy  to  analyse.  A  “coupon  collector”  argument 
yields  the  following  bound. 

Theorem  2.10  For  the  (n,  2,0)  Bernoulli- Laplace  process,  with  n  >  2,  ft  >  ft,  we  have 

||t*  -  *!l  <  *“c  whenever  k  >  j^~(ln(ftft/n)  +  c). 

Proof:  In  the  coupling  described  before,  let  ft,*  be  the  event  that  the  »th  ball  is  still  uncoupled 
after  the  kth  step.  The  event  {T  >  k}  is  precisely  the  union  (J<  ft,*,  where  the  union  is  taken  over 
all  of  the  balls  i  in  the  second  urn  of  process  X  after  the  kth  step.  So  Pr{T  >  k)  <  £t-Pr{£ i,*}  = 
ftPr{ft0.*},  where  io  is  any  one  ball  in  the  second  urn.  At  time  k  =  0  we  have  Pr{ft0,o}  =  ft/**- 
Now  consider  that,  in  any  given  step,  a  given  uncoupled  ball  (necessarily  in  urn  1  in  one  of  the  two 
processes)  can  get  coupled  in  one  of  2  ways:  Either  it  stays  fixed  and  its  counterpart  in  the  other 
process  moves  into  the  corresponding  urn,  which  occurs  with  probability  x  or  it  moves  into 
the  other  urn  and  its  counterpart  in  the  other  process  stays  fixed,  which  occurs  with  probability 
x  We  ^us  t^iat  a  Pven  ttncouPkd  has  probability  exactly  p  = 
of  being  coupled  at  any  step,  and  thus  Pr{ft0,*}  <  ^(1  -  p)*.  So  Pr{T  >  k}  <  ^*£*(1  -  p)k. 
Writing  this  in  terms  of  the  exponential,  using  the  Taylor  expansion  for  the  logarithm,  and  applying 
the  Coupling  Inequality  then  yields  the  stated  bound.  B 

Remark:  The  case  when  ft  =  ft  =  f  gives  the  largest  value  of  this  bound  on  the  time  to 

stationarity.  In  this  case,  we  have  total  variation  distance  not  exceeding  e~c  after  k  >  ^—^(ln  n  — 
In  4  ■+■  c),  or  about  -n(ln  n  +  c)  steps.  This  matches,  within  a  factor  of  2,  the  bound  obtained  by 
Diaconis  and  Shahshahani  in  [DS87],  but  our  argument  is  elementary.  Their  bound  is  obtained  using 
representation  theory  which  we  discuss  briefly  later.  The  properties  they  rely  on  no  longer  hold  for 
the  case  of  n  >  m  >  3.  We  can  give  some  analysis  of  the  coupling,  and  we  give  some  new  geometric 
techniques  based  on  the  underlying  graph.  O 

2.2.3  Weak  Analysis  of  the  General  Case 

The  preceding  argument  generalises  easily  to  handle  more  than  iwo  urns  each  containing  at  least 
two  balls.  In  this  situation,  each  uncoupled  ball  has  some  chance  of  being  coupled  in  the  next  step, 

i 

as  in  the  two- urn  case. 


2.2.  bERNOULLl-LAPLA CE- TYPE  MODELS 


37 


Theorem  2.11  Let  Z  be  an  (n,  m,  0)  Bernoulli- Laplace  process,  with  n>2,0i>02>-">0m> 
2.  Let  *k  denote  the  distribution  of  the  state  of  Z  after  the  kth  step,  let  r  denote  the  stationary 
distribution,  and  let  Hr*  -  t||  denote  the  total  variation  distance  between  them.  Then  we  have 

||*t  -  *11  < 


whenever 


(T)M 

01+02—2 


(In(n  —  0i)  +  e). 


Proof:  Again  let  £<,*  be  the  event  that  the  tth  ball  is  still  uncoupled  after  the  kth  step.  Let  /*  be 
the  set  of  all  balls  in  process  A',  except  those  that  lie  in  urn  1  immediately  after  the  fcth  step  of  the 
coupling.  Note  that  for  all  Jb,  |J*|  =  »  -  ft.  The  event  {T  >  ' }  is  precisely  the  union  U€/.  Ei.k, 
since  if  after  step  k  all  balls  in  the  set  Ik  are  coupled,  then  necessarily  those  in  the  first  urn  are 
coupled  as  well.  So  Pr{T  >  *}  <  £<<=,„  Pr{£..*}  <  (n  -  A)(l  -  p)k,  where  p  is  a  rower  bound 
on  the  probability  that  a  given  uncoupled  ball  becomes  coupled  in  a  single  step  of  the  process.  To 
obtain  a  value  for  p,  notice  that  in  any  given  step,  a  given  uncoupled  ball  b  can  become  coupled  if 
the  following  sequence  of  events  occurs. 

1.  A  pair  of  urns  {u,v}  where  urn  u  contains  the  ball  6  in  one  process  and  urn  v  contains  the 
counterpart  ball  6  in  the  other  process  is  chosen.  This  happens  with  probability  1/(7). 

2.  Then  either  the  ball  in  tt  is  chosen  to  move  into  the  uin  v  (probability  l/0u),  while  the 
counterpart  remains  fixed  (probability  (0V  -  l)/&),  or  alternatively,  the  ball  in  u  stays  fixed 
(probability  (&,  -  l)//3»),  and  the  counterpart  in  v  moves  to  u  (probability  1/&). 

Thus  an  uncoupled  ball  has  probability  at  least 

p  —  min(/3,, +/3.  —  2)/(^2^^“^»)  =  (&  +  0*  ~  2)/( (2)^^ 

of  being  coupled  at  any  step.  Using  this  value  of  p  with  Pr{T  >  i}  <  (n  -  0i)(l  -  p)*>  yields  the 
stated  bound.  B 


Example  2.12  For  the  case  of  n  balls  distributed  evenly  in  m  =  3  urns,  the  theorem  says  that 
approximately  ininn  swaps  are  sufficient  to  get  near  uniformity.  More  generally,  for  n  balls  dis¬ 
tributed  evenly  in  any  fixed  number  m  of  urns,  O(nbn)  steps  are  sufficient  to  get  within  any  fixed 
variation  from  uniformity.  But  if  m  is  allowed  to  increase,  this  bound  grows  as  m2.  This  is  wrong, 
but  seems  inherent  in  this  line  of  analysis.  □ 

A  similar,  but  even  weaker,  argument  can  be  used  to  give  bounds  when  one  of  the  urns  contains  a 
single  ball.  When  more  than  one  urn  contains  only  one  ball,  a  slightly  different  argument  is  needed. 
For  in  this  case,  uncoupled  balls  that  reside  in  one  of  the  single-ball  urns  and  whose  counterparts 


38 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


in  the  other  process  are  also  in  single-ball  urns  have  no  chance  of  getting  coupled  in  a  single  step. 
They  do  have  a  chance,  however,  of  getting  coupled  in  two  steps  by  first  moving  (or  having  their 
counterparts  move)  to  an  urn  with  many  balls,  and  then  coupling  in  one  more  step.  Letting  l 


denote  the  number  of  urns  containing  at  least  two  balls,  the  probability  of  this  two-step  event  can 
be  bounded  below  by  & p,  where  p  is  that  defined  in  the  proof  above.  This  places  a  multiplicative 

U; 

factor  of  on  the  time  to  stationarity  obtained  above.  Clearly  this  analysis  is  not  very  tight,  since 
putting  l  =  m  does  not  return  us  the  bound  of  Theorem  2.11,  but  instead  sacrifices  a  large  factor 
of  about  y.  A  significantly  tighter  analysis  using  this  coupling  is  not  evident. 


2.3  Markov  Chains  based  on  Groups 

In  the  remainder  of  this  chapter  we  explore  random  walks  based  on  groups  and  group  actions,  which 
is  the  natural  setting  in  which  to  work  more  generally  with  urn  models.  We  will  use  the  Bernoulli- 
Laplace  model  as  a  running  example.  Before  proceeding,  we  need  to  introduce  a  substantial  amount 
of  terminology. 

2.3.1  Transitive  Group  Actions 

We  assume  the  reader  is  familiar  with  the  basic  properties  of  a  group.  Otherwise  the  reader  may 
refer  to  Hungerford’s  book  [Hun74].  A  more  elementary  text  will  do.  Here  when  we  speak  of  a 
group,  wc  will  mean  a  finite  group. 

Let  X  be  a  finite  set.  A  (left)  action  of  a  group  T  on  X  is  a  function  a  mapping  T  x  X  to  X 
sue h  that  for  all  x  €  £i»  92  €  T 

a  (id,  x)  =  x  and  a(g2>  <*(Pii  *))  =  <*(929i,  *)» 

where  id  denotes  the  identity  element.  It  is  natural  to  omit  a  and  write  simply:  gx  for  g  £  I\  x  €  X. 
This  notation  is  ambiguous;  the  same  group  may  have  several  different  actions  on  a  set.  But  in 
context,  it  poses  no  problems. 

Given  an  action,  and  elements  z,y€X,  say  that  *  may  be  mapped  to  y,  if  gz  =  y  for  some 
g  g  r.  This  is  an  equivalence  relation  on  X,  and  the  equivalence  classes  are  called  the  orbits  of  the 
action.  The  equivalence  class  containing  z  is  called  the  orbit  of  x.  If  all  elements  may  be  mapped 
to  each  other,  that  is,  if  the  action  has  only  one  orbit,  the  action  is  called  transitive.  From  this 
point  on,  toe  restrict  our  attention  to  transitive  actions. 

Example  2.13  A  group  T  acts  transitively  on  itself  through  its  multiplication  map:  Let  X  -  T 
and  let  a(g,z)  be  the  group  element  gz  obtained  by  multiplying  9  and  x.  This  action  is  called  left 
translation,  and  the  corresponding  right  action  of  T  is  called  right  translation.  □ 


2.3.  MARKOV  CRAINS  BASED  ON  GROUPS 


39 


Example  2.14  We  get  a  transitive  action  of  the  symmetric  group  Sn  on  [n]  by  defining  for  a  €  S«, 
x  t  [n],  ax  =  <r(r),  the  image  of  the  element  x  under  the  permutation  a.  □ 

Example  2.15  The  symmetric  group  S„  acts  “elementwise”  on  the  set  of  all  i-element  subsets  of 
[nj.  For  a  k- set  x  =  {xj,X2,..  .  ,x*}  and  a  €  S„,  define 

ax  =  {o-(xi),£r(*2)t.  • 

the  set  of  the  images  of  the  elements  of  *  under  the  permutation  a.  This  gives  a  transitive  action. 
□ 

Remarks  If  T  acts  on  A'  consider  the  function  y(x)  =  gx.  This  is  one-to-one,  for  if  ffn  =  gxj 
then  zi  =  x2;  it  is  also  surjection  since  s(y_,y)  =  y-  This  allows  one  to  view  a  group  action  as  a 
homomorphism  of  T  into  the  group  of  permutations  of  X.  O 

If  x  and  y  are  any  two  elements  of  X ,  let  T zy  denote  the  (non-empty)  set  of  elements  mapping 
x  to  y.  That  is,  =  {g  \  gx  =  y}. 

A  group  element  g  such  that  gx  =  x  is  said  to  fix  x.  The  set  of  all  elements  that  fix  x  forms  a 
subgroup  of  T  called  the  stabilizer  or  isotropy  subgroup  of  x  and  denoted  Stab(x).  Note  that 
Stab(x)  =  r„. 

Lemma  2.16  Let  T  act  transitively  on  X.  For  given  z,y,  let  go  €  TtJ  be  some  designated  element 
mapping  x  to  y.  Then  every  element  in  Tt„  may  be  written  uniquely  as  g0n  for  some  n  €  Stab(x). 

Proof:  Define  the  function  /(n)  =  g0n.  We  rnsh  to  show  this  gives  a  bijection  of  Stab(x)  onto  T*,. 
It  is  clearly  a  one-one  function  by  cancellation.  Moreover,  the  image  of  Stab(x)  is  clearly  contained 
in  r„. 

It  remains  only  to  show  that  every  element  in  T„  is  the  image  /(n)  of  some  element  in  n  € 
Stab(x).  Suppose  g  6  T,,  so  that  gx  =  y.  Then 

(So  =  9o  l(?*)  -  9o  *y  =  So  *(?<>*)  =  (So  ‘so)*  =  *• 

So  n  —  gg*g  is  in  Stab(x),  and  /(n)  =  gon  =  y.  B 

Corollary  2.17  It  follows  that  for  every  x,  y  €  X 

in 

|r„|  =  |Stab(*)|  =  . 

Note  that  the  latter  is  independent  of  x  and  y. 


40 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


Proof:  The  first  equality  is  immediate  from  Lemma  2.16.  For  the  second,  fii  as  *  €  X.  Clearly 
F  =  U*c;r  r«*  (since  every  g  maps  x  to  some  y),  and  the  T*,  are  disjoint  (since  g  cannot  map  x  to 
two  different  y’s).  Ftom  the  first  equality,  each  F*j  has  site  |Stab(x)|,  so  |T|  —  |A||Stab(x)|.  B 

Remark:  The  latter  equality  is  a  special  case  of  a  more  general  theorem. 

|Stab(*)!  =  |Orb(x)|  ’ 

where  Otb(x)  denotes  the  orbit  of  x.  For  a  proof,  see  [Hun74,  Theorem  4.3].  O 

Here  is  another  useful  way  of  viewing  things  that  we  shall  use.  Designate  an  element  x0  €  X 
and  let  N  —  Stab(xo)  be  the  isotropy  subgroup.  For  y  €  X,  the  sets  r,„,  are  precisely  the  cosets  of 
Stab(xo)  in  T.  If  g  maps  x0  to  y  then  every  element  of  gN  maps  x  to  y.  If  J  and  h  both  map  x0  to 
y  then  hN  =  gN. 

If  we  designate  an  element  g9  in  each  distinct  coset  r*of,  we  get  a  system  of  coset  represen* 
tatives.  This  gives  a  mapping  of  A'  to  the  coset  space  T/N  that  respects  the  action  of  T:  the  coset 
corresponding  to  xo  is  N  itself,  and  in  general  for  y  €  X  the  corresponding  coset  is  g^N  —  TZof. 
Every  element  of  T  can  be  written  uniquely  as  gtn  for  some  y  and  seme  n  £  N,  and  for  any  x,y, 
every  element  of  can  be  written  1  for  some  n£  N. 

All  of  the  isotropy  subgroups  are  conjugate  since  Stab(y)  =  yJ"lStab(*o)»y  This  means  our 
choice  of  xo  was  not  important,  since  this  conjugacy  gives  an  isomorphism  of  the  group. 


2.3.2  Cayley  Graphs 

Let  T  act  transitively  on  X.  Let  5  be  a  symmetric  set  of  generators  for  T:  5  generates  T  and 
s  €  5  implies  s“l  €  S.  The  simple  graph  G  =  (A,E)  whose  vertex  set  is  X,  and  whose  edges 
are  E  =  {{*,y}  |  x,y  €  X,x  *  y,  y  =  sx  for  some  *  6  5}  is  the  (Cayley)  graph  of  the  group 
action  with  respect  to  the  generators  5,  and  we  denote  it  Cayley(T ,  X,  S),  where  the  action  is 
understood  from  context.  This  graph  is  connected  since  the  action  is  transitive.  In  the  special  case 
when  X  =  T,  the  group  itself,  and  the  action  is  given  by  translation,  the  graph  Cayley(r,I\S)  is 
known  simply  as  the  Cayley  graph  of  the  group  (with  respect  to  the  generators  S). 

Example  2.18  Let  T  be  5..  Let  S  be  the  set  of  all  transpositions.  Then  Cayley(r,I\S)  is  the 
graph  with  a  vertex  for  each  arrangement  of  [n]  =  {1,2, ...,n}  with  an  edge  between  two  vertices 
if  their  arrangements  differ  by  one  swap.  Fix  a  k  and  let  X  be  the  set  of  k- sets  of  (»].  Let  T  act 
elementwise  on  the  i-sets.  Then  Cayley(I\  X,  S)  is  the  graph  with  a  vertex  for  each  of  the  (*)  k- sets 
of  [n]  and  an  edge  between  two  vertices  if  the  symmetric  difference  of  the  corresponding  k-sets  has 
cardinality  2.  □ 


2.3.  MARKOV  CHAIRS  BASED  ON  GROUPS 


41 


2.3.3  Vertex-Transitive  Graphs 


If  G  =  (V,  E)  is  an  undirected  sim 
all  permutations  g  of  the  vertices 


pie  graph,  its  automorphism  group,  denoted  Aut(G),  is  the  set  of 
such  that  g(E)  =  E,  where  g{E)  denotes  the  set  composed  of  the 


elements  {s(v),  s(u>)}  for  each  {v,  u>}  €  E. 

The  graph  G  is  called  vertex-transitive  if  Aut(G)  acts  transitively  on  the  vertices,  )t  in  other 
words,  if  for  every  parr  of  vertices  v,  w  €  V  there  is  an  automorphism  g  €  Aut(G)  such  that  g(v)  =  w. 


Intuitively,  this  means  that  the  graph  “locks  the  same  £rom  every  vertex. 

It  is  well  known  that  Cayley  graphs  of  groups  are  always  vertex-transitive  [Big74].  We  include  a 


proof  for  completeness. 


Theorem  2.19  If  T  is  any  group  and  S  is  any  symmetric  set  of  generators  of  T,  then  G 
Cayley(r,r,S)  is  vertex-transitive. 


Proof:  T  acts  transitively  on  itself  by  left  translation  x  *->  gx  for  *  €  F,  and  it  is  easy  to  see  that 
each  g  €  T  gives  an  automorphism  of  the  graph  by  this  action.  Let  G  =  (I\  E)  =  Cayley(I\r,S). 
Then 


(x,y)  €  E  o  yx*1  €  S 

O  y gg~1x~l  €  S 

((s*).(?y)) «  E 


It  follows  that  the  permutations  of  T  given  by  x  —  gx  are  all  in  Aut(G),  and  smee  T  acts  transitively 
so  does  Aut(G).  S3 


Example  2.20  The  same  does  not  hold  generally  for  Cayley  graphs  of  group  actions.  Cayley  graphs 
of  group  actions  are  not  always  vertex-transitive.  Take  T  =  S4,  and  let  X  be  the  set  of  2-element 
subsets  of  [4].  For  generators  take  S  =  {(12),  (234),  (432)}.  The  graph  has  ©  =  6  vertices  so  that  it 
can  easily  be  drawn.  One  can  see  by  inspection  that  Cayley(I\  X,  S)  has  no  automorphism  mapping 
the  vertex  for  the  set  {1. 2}  which  has  degree  2,  to  the  set  {1, 3},  which  has  degree  3. 

It  is  possible  to  define  Cayley  graphs  of  group  actions  as  graphs  with  self-loops  and  multiple 
edges  so  that  they  are,  at  least,  regular.  One  can  check  that  the  preceding  example  is  still  not 
vertex-transitive  under  that  definition.  □ 

Cayley  graphs  of  group  actions  are  vertex-transitive  only  in  certain  cases.  For  example,  if  T  acts 
transitively  on  X  and  the  isotropy  subgroup  N  is  a  normal  subgroup  of  T,  then  Cayley(I\  X. S) 
is  isomorphic  to  Cayley(r/N,  T/N,  S'),  the  graph  of  the  quotient  group  T/N  under  the  generators 
S'  =  (J<€5  »N.  A  more  interesting  situation  is  the  following. 


42 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


Theorem  2.21  Lei  T  act  transitively  on  X,  and  let  S  be  o  symmetric  set  of  generators  for  T .  If  S 
is  closed  under  conjugacy  then  G  =  Cayley(I\  A",  5)  is  vertex -transitive. 

Proof:  Let  i  h»  51  be  the  action  of  T  on  A*.  We  know  this  is  a  transitive  action.  We  show  that 
every  y  €  T  gives  automorphism  of  G  =  (A',  E)  through  this  action. 


(x,y)  €  E  o 
o 
o 
o 
o 


y  €  Sx  and  x  ^  y 
y  €  y“*Syx  and  x  ^  y 
yy  €  Syx  and  x  ^  y 
(yx.yy)  €  E  and  x  ^  y 
(y*,ffy)  €£and  y*  9*  yy. 


Example  2.22  In  Sn>  the  transpositions  form  a  single  conjugacy  class.  Thus  the  graph  of  k-sets 
described  in  Example  2.18  is  vertex-transitive.  □ 

2.3.4  Chains  Based  on  Groups 

A  Markov  chain  is  a  random  walk  on  a  group  T  if  its  transition  matrix  P  has  entries  given  by 

P*,t  =  p(y*_1)  (Vx.yer) 

where  p  is  some  probability  distribution  on  the  group  F.  The  distribution  p  is  called  the  transition 
measure. 

The  chain  P  is  ergodic,  provided  that  the  support  of  the  transition  measure  Supp(p)  =  {y  | 
p{y)  >  0}  generates  T.  Moreover,  it  is  easy  to  see  that  the  transition  matrix  P  of  any  random 
walk  on  a  group  is  doubly  stochastic.  This  means  that  its  stationary  distribution  is  the  uniform 
distribution  on  T. 

Because  the  stationary  distribution  is  uniform,  P  will  be  time-reversible  precisely  if  P  is  sym¬ 
metric.  This  can  be  stated  in  terms  of  the  transition  measure:  P  is  time- reversible,  and  in  fact 
symmetric,  iff  p(y)  =  pfa*1).  Ia  this  case,  we  will  call  P  a  symmetric  random  walk  on  T. 

If  the  group  T  acts  transitively  on  a  finite  set  X ,  any  random  walk  P  on  T  induces  a  Markov 
chain  P'  on  X  as  follows.  Fix  an  element  xo  €  X .  When  the  walk  P  is  in  state  y  €  T,  we  specify 
that  the  Markov  chain  P'  is  in  state  yx0  €  X.  It  is  easy  to  sec  that  if  x*  is  the  state  of  the  chain 
P*  at  time  k,  then  the  next  state  is  determined  by  x*+i  =  yx*  where  g  €  T  is  chosen  according  to 
the  transition  measure  p  of  the  original  walk  P.  We  say  that  Pf  is  P  induced  directly  to  X .  The 
entries  of  the  transition  matrix  P*  are: 


2.3.  MARKOV  CHAINS  BASED  ON  CROUPS 


43 


where  N  =  Stab(r0)-  The  underlying  graph  of  any  symmetric  random  walk  on  T  induced  directly  to 
X  is  the  Cayley  graph  of  the  group’s  action.  We  can  and  will  use  this  to  our  advantage  in  bounding 

eigenvalues. 

For  a  certain  class  of  transition  measures,  there  is  another  way  to  induce  a  walk  on  A'  through 
a  group’s  action.  This  alternate  method  is  less  direct  but  yields  a  chain  with  greater  symmetry.  A 
measure  p  (or  more  generally  any  function  on  T)  is  called  tf-bi-invariant  if  for  every  g  €  T  and  for 

all  ni,n2  €  N, 

p(r»ijn2)  =  p(j). 

Again,  fu  an  so  6  A'  and  let  N  =  Stab(x0).  UP'**  random  walk  on  T  given  by  an  tf-bi-invariant 
transition  measure  p,  then  we  can  define  a  Markov  chain  P  on  X  bv 

P*.»  d=  (2-3) 

This  is  well-defined,  and  does  not  depend  on  our  choice  of  coset  representatives  since  any  elements 
of  the  cosets  gzN  and  gtN  can  be  represented  as  gtni  g,nj  for  T»i,nj  €  N,  whence  for  some  n3£  N 
we  have 

P((y*n x)-1(^»n2)^)  =  19t Nnj)  =  p{9z  l9yN)y 

by  the  bi-invariance  of  p. 

We  say  that  P  is  P  induced  symmetrically  to  X.  The  chain  has  the  following  intuitive 
description.  At  each  step,  we  imagine  we  are  at  *0,  and  we  first  choose  a  new  position  as  if  we  were 
at  *0  i.e-i  we  choose  a  position  gx0  choosing  g  €  T  according  to  p(ff).  Then,  using  the  action  of  gx, 
we  translate  this  whole  move  (x0,ff*o)  into  a  move  from  x  to  y  =  gxgx 0.  It  is  easy  to  see  that  to 
move  from  z  to  a  given  element  y  it  is  necessary  and  sufficient  that  g  €  gxlg,N. 

The  chain  satisfies  the  following  symmetry  condition. 

=  vjer.  (2-4) 

and  it  is  easily  verified  that  any  Markov  chain  P«,t  on  X  =  T/N  that  satisfies  this  condition  is 
generated  by  an  N-bi-invariant  transition  measure  given  by  p{g)  =  P*„,»*«- 

Theorem  2.28  Lei  F  act  transitively  on  X  with  isotropy  subgroup  N  =  Stab(x0).  Let  p  be  a 
transition  measure  on  T  that  is  constant  on  conjugacy  classes.  The  chain  P'  induced  directly  to  X 
is  the  same  as  the  chain  Q  induced  symmetrically  to  X  starting  from  the  N -bv-mvariant  transition 
measure 

«(ff)  =  jjyjPfoJO- 

Proofs  First,  let  us  see  that  Q  is  well  defined  by  showing  that  «(ff)  is  indeed  N-bi-invariant. 
Right-invariance  is  clear.  On  the  left,  for  ti  €  N  we  have 

q(ng)  =  |^|P(ns^)  =  = 


44 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


since  p  is  constant  under  conjugacy  and  N  is  a  subgroup. 

Now  we  want  to  verify  that  Qx f°r  X*V  €  A#.  To  do  this,  we  first  show  that  the 
chain  Pf  satisifies  the  symmetry  condition  (2.4).  Then,  by  that  symmetry,  it  will  suffice  to  consider 

only  Xo  and  show  that  Pxo.t  =  Q* o,»  for  aU  y  €  X. 

The  chain  P'  satisifies  the  symmetry  condition  because  p  is  constant  on  conjugacy  classes.  Specif- 

ically,  we  have 

p‘ax,„  =  p((s<7,)JV(w*)_1)  =  p{Q9^g:xg-1)  =  Pig^g;1)  =  p i. 


for  every  g  €  I\ 

Now,  restricting  our  attention  to  xo,  we  have 

p, („,  =  p(s,tf  s* o)  =  p(9»^) - =  g(g*x StN)  -  <?*<-.»• 

Here  in  the  second  equality,  we  have  used  the  fact  that  g,0  b  in  N.  In  the  fourth  equality,  the  same 
fact  was  combined  with  the  ^-invariance  of  q.  B 

Example  2.24  Let  P  be  the  random  walk  on  T  =  5„,  generated  using  p(id)  =  1/n,  p(g)  =  2/nJ  if 
g  is  a  transposition  and  p(g)  —  0  otherwise.  Let  X  be  the  k-sets  of  n,  and  let  xo  —  {1, 2, . • k).  The 
group  F  acts  transitively  on  X  elementwbe;  the  isotropy  subgroup  N  —  Stab(*o)  b  isomorphic  to 
Sk  x  £„-*•  The  directly  induced  walk  b  P',  whose  nonsero  entries  are  P*if  =  2/nJ  if  the  symmetric 
difference  of  *  and  y  has  cardinality  2,  and  Px  x  =  1  -  (2k(n  -  k)/nJ)  for  each  x.  The  underlying 
graph  b  Cayley(I\  X,  5),  where  S  b  the  set  of  transpositions. 

The  theorem  says  thb  b  the  same  as  the  chain  Q  induced  symmetrically  bom  the  ^-bi-invariant 
transition  measure  q(g)  =  jfap(gN).  For  example,  <?*,*  =  q(N)  =  (5  +  £(2)  +  = 

1  -  (2k(n  -  k)/nJ),  as  required. 

Note  that  thb  chain  b  not  quite  the  same  as  the  Bernoulli-Laplace  model;  it  differs  in  the  holding 
probabilities.  The  Bernoulli-Laplace  model  corresponds  to  the  walk  induced  symmetrically  bom  the 
IV-bi-in  variant  measure  determined  by  q(gN)  —  le(ii-k)  0^  b  *  coset  corresponding  to  a  set  within 
a  single  exchange  of  *o-  D 

Remark:  Thb  gives  a  partial  answer  to  a  question  posed  by  Diaconb  and  Shahshahani  [DS87, 
p.  213]:  when  can  the  directly  induced  walk  bom  a  given  transition  measure  p  be  viewed  as  the 
symmetrically  induced  walk  of  some  N-bi-in variant  measure  9?  They  give  techniques  for  dealing 
with  the  latter  in  some  cases  (which  we  discuss  briefly  in  the  next  section).  Thb  gives  a  class  of 
cases  in  which  the  same  techniques  will  apply  to  the  directly  induced  walk.  O 

The  following  should  be  noted.  A  coupling  for  a  random  walk  on  T  induces  couplings  for  the 
induced  walks  in  the  obvious  way  (both  in  the  direct  and  symmetric  case).  When  the  coupling  for 
T  succeeds,  so  does  the  induced  coupling.  So  a  maximal  coupling  for  T  induces  a  coupling  for  the 


2.3.  MARKOV  CHAINS  BASED  ON  GROUPS 


45 


induced  walk  that  converges  at  least  as  fast.  This  gives  us  the  following  simple  theorem.  (The 
theorem  also  has  other  simple  proofs  in  terms  of  properties  of  variation  distance.) 

Theorem  2.25  Both  the  direct  and  symmetrically  induced  walks  converge  in  variation  distance  at 
least  as  fast  as  the  random  walk  on  the  group. 

2.3.5  On  the  Harmonic  Analysis  Approach 

Diaconis  [Dia88]  has  applied  representation  theory  to  give  good  bounds  on  the  time  to  stationarity 
for  a  number  of  random  walks  on  groups.  The  advantage  of  such  results  is  that  they  typically  give 
bounds  on  the  full  spectrum,  and  lead  to  sharp  convergence  rate  bounds;  in  many  cases  matching 
lower  bounds  on  the  convergence  rate  can  be  found.  Similar  techniques  have  been  used  by  Flatto, 
Odlysko,  and  Wales  [FOW85]  to  analyse  the  time  to  hit  a  particular  state,  and  by  Matthews  [Mat85] 
to  analyse  the  time  to  hit  all  states  (covering  time). 

The  idea  behind  the  harmonic  analysis  approach  can  be  summarised  as  follows.  For  background, 
the  reader  should  consult  [Dia88].  If  P  is  a  random  walk  on  a  group,  the  fcth-step  distribution 
T(t  _  r0Pk  may  also  be  expressed  as  the  Jt-fold  convolution  of  the  transition  measure.  That  is  to  say 

»*(y)  =p'*(y)  =  ^p(y^~1)p’<k't>(x)- 

ter 

At  any  representation  of  the  group,  the  Fourier  transform  has  the  property  that  the  transform 
of  a  convolution  of  two  functions  is  the  product  of  the  individual  transforms.  This  converts  the 
complicated  convolution  operation  on  the  group  to  a  matrix  multiplication.  Using  a  few  basic 
properties  of  representations,  one  gets  the  following  lemma. 

Lemma  2.26  (Upper  Bound  Lemma  for  Groups  and  Actions)  [DS81,  Dia88]  If  P  is  the 
random  walk  on  T  with  transition  measure  p  then  the  k-th.  step  distribution  rk  of  P  satisfies 

-  n*  <  \  x>gpTt[(p(<>)mpnk]- 

P 

% oherep(p)  represents  the  Fourier  transform  at  the  representation  p  and  the  sum  is  over  all  non- trivial 
irreducible  representations  p.  (Here  *  indicates  the  conjugate-transpose  operation .) 

If  N  is  a  subgroup  ofT  and  the  transition  measure  p  is  N  bi-invariant,  the  kth-step  distribution 
of  the  symmetrically  induced  walk  P  on  X  =  T/N  satisfies 

||*t  -  tf||J  <  ^  ^degpTrK^p))4^)’)*]. 

P 

where  the  sum  is  over  all  non-trivial  irreducible  representations  p  that  occur  in  I  (A'),  the  represen¬ 
tation  given  ly  the  set  of  all  complex-valued  functions  on  X . 


46 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


This  is  analogous  to  the  bound  of  Theorem  1.9.  The  eigenvalues  ofp(p)  are  those  of  the  Markov 
transition  matrix  P  appearing  with  multiplicity  given  by  the  degree  of  p.  For  details  see  Diaeo- 
nis  [Dia88,  pp.  48*49].  Essentially  the  same  relationship  is  discusssed  by  Babai  [Bab79]  with  respect 
to  the  eigenvalues  of  the  Cayley  graph’s  adjacency  matrix. 

The  repeated  multiplication  of  the  Fourier  transform  matrices  p{p)  is  not  necessarily  any  more 
analysable  than  the  multiplication  of  the  transition  matrix  of  the  Markov  chain.  But  under  certain 
circumstances  one  can  insure  that  the  matrices  p(p)  that  arise  have  special  structure  that  makes  the 
analysis  simpler.  Except  for  a  few  special  cases  (e.g.  [BD89]),  here  is  an  essentially  complete  list  of 
the  circumstances  that  Diaconis  and  others  have  been  able  to  take  advantage  of. 

•  Abelian  groups.  When  T  is  abelian,  all  of  its  irreducible  representations  are  1-dimensional. 
That  is,  the  Fourier  transform  matrices  p(p)  involved  are,  in  fact,  (complex)  scalars. 

•  Transition  measures  constant  on  conjugacy  classes.  When  the  transition  measure  p  is 
constant  on  conjugacy  classes  the  Fourier  transforms  at  every  irreducible  representation  are 
almost-scalar;  they  are  of  the  form  cl  where  I  is  the  identity  matrix  of  order  equal  to  the  degree 
of  the  representation.  On  specific  groups  in  which  the  character  theory  is  well-developed,  this 
can  make  the  problem  tractable. 

•  Symmetrically  induced  walks  when  (T,  N)  forms  a  Gelfand  Pair.  The  group  T  and 
subgroup  N  form  a  Gelfand  pair  when  the  convolution  of  N* bi-in variant  functions  on  T 
is  commutative.  Under  these  circumstances  Diaconis  and  Shahshaham  [DS87,  Dia88]  give 
a  beautiful  treatment  for  walks  P  on  T/N  induced  symmetrically  from  an  JV-bi*  in  variant 
transition  measure  p  on  I\  They  take  advantage  of  the  fact  that,  in  this  case,  the  Fourier 
transform  of  p  at  any  irreducible  representation  is,  in  some  basis,  a  matrix  with  a  single 
nonsero  entry  in  position  (1,1).  Again,  where  the  representation  theory  allows  calculation  of, 
or  at  least  bounds  on,  these  (1, 1)  entries,  there  is  hope  of  getting  good  convergence  bounds. 
By  Theorem  2.23,  a  similar  analysis  applies  to  directly  induced  walks  on  T/N  when  (I\  N)  is 
a  Gelfand  pair. 

When  these  techniques  apply,  and  a  tight  elementary  argument  is  not  apparent,  they  should 
be  used.  They  seem  to  give  tight  bounds  for  many  cases.  In  addition  there  may  be  other  reasons 
that  one  wants  to  do  the  harmonic  analysis,  since  there  are  other  applications  of  the  representation 
theory  to  the  analysis  of  certain  statistical  data  that  can  be  viewed  as  data  on  groups.  [Dia88] 

The  new  methods  given  in  the  remaining  sections  can  be  applied  when  the  harmonic  analysis 
does  not  prove  tractable. 

2.3.6  Magnification  Bounds 

In  this  section  we  present  some  new  bounds  on  the  magnification  of  some  classes  of  underlying  graphs 
that  arise  from  random  walks  on  groups.  These  bounds  use  very  little  except  basic  group  structure, 


2.3.  MARKOV  CHAINS  BASED  ON  CROUPS 


47 


yielding  results  that  are  in  terms  of  simpler  quantities,  “diameters."  Such  bounds  are  not  usually 
tight,  but  are  simple  to  calculate  and  give  reasonable  results. 

The  basic  structural  properties  of  group  actions  allow  us  to  eitcr A  a  theorem  of  Aldous  [Ald87]  to 
obtain  the  following  generalization.  The  proof  closely  matches  Aldous’s  proof.  Recall  the  definition 
of  magnification  from  Section  1.5.2. 


Theorem  2.27  (Magnification  in  Cayley  Graphs)  Suppose  T  octs  transitively  on  X,  and  lei 
G  =  Cayley (r,  A’,  S).  Lei  A  any  number  such  that  every  element  of  the  group  F  can  be  written  with 
at  most  A  generators  from  S.  Then  G  has  magnification 


Proofs  Let  AC  X  with  |A|  <  |A|/2.  Let  %,&)  be  the  function  that  is  1  if  o  =  i,  0  otherwise. 
By  Corollary  2.17 

ir.,1  =  |Stab(x)|  =  JH. 

Therefore  we  have, 


Y,\9AOA\ 
»€  r 


gtTxtA 

EEE{(}i’v) 
H  H  ir*»i 

*6  A  It  A 

\Am 

1*1  ' 


Thus  for  some  g  in  F,  we  must  have  \gAC\A\  <  (the  average),  which  implies  that  \gA-A\  >  Ml/2. 

Now  write  g  =  hdhd-i  •  •  -hjhi  for  some  d  <  A  with  each  hi  €  S,  and  let  go  —  id,  S«  = 
hihi-i  •  •  -h3hi  for  1  <  i  <  d.  Then  we  have 

d  d 

| gA  —  A|  <  ls»A  —  ji-iA|  =  ^  -  A|. 

i=i  ♦=! 


So  there  must  be  an  element  h  among  the  hi  such  that 


•  r.  .(>l  §A-A\  \A\ 
\m-a\>  -  >  2A- 


This,  in  turn,  gives  c  >  5^.  O 


When  the  set  A  is  taken  to  be  T  itself,  and  the  action  is  given  by  the  usual  multiplication  in  T 
the  graph  G  is  simply  the  Cayley  graph  of  the  group  under  5,  and  A  is  the  diameter  of  the  graph  G 


48 


CHAPTER  2.  URN  MODELS  AND  CROUP  ACTIONS 


This  is  a  special  case  for  which  Aldous  proved  the  *ame  eigenvalue  bound.  When  X  is  not  F,  the 
quantity  A  in  the  lower  bound  is  not  necessarily  the  diameter  of  the  graph. 

We  have  seen  that  Cayley  graphs  of  groups  are  vertex  transitive.  Babai  [Bab90]  has  recently 
proved  the  following  theorem  which  gives  a  generalisation  of  Aldous’s  result  in  this  direction.  In  this 
the  quantity  A  corresponds  to  the  diameter.  We  adapt  his  proof  only  slightly  here.  Note  that  Cayley 
graphs  of  group  actions  are  not  always  vertex- transitive,  so  this  does  not  completely  supersede  the 
previous  theorem. 

Theorem  2.28  (Magnification  in  Vertex-Transitive  Graphs)  (From  [Bab90])  Let  G  be  a 
vertex- transitive  graph  with  diameter  A.  Then  G  has  magnification 


c  >  1/2A. 


Proof:  Let  G  =  (V,  E)  and  let  T  =  Aut(G)  be  the  automorphism  group  of  G.  We  know  this  acts 
transitively  on  V .  Fix  a  vertex  v0  6  V  and  let  N  =  Stab(u0)  be  the  isotropy  subgroup.  And  for 
each  element  v  £  V  let  gt  be  a  representative  element  in  the  coset  F,p*. 

We  construct  a  certain  Cayley  graph  of  the  group  T  and  show  that  the  earlier  theorem  applied 
to  this  Cayley  graph  implies  the  result  for  G . 

Let  C  be  the  graph  with  vertex  set  T  in  which  we  join  two  vertices  g,  h  £  T  if  gv0  =  hv0  or  if 


(gvo.hvo)  is  an  edge  in  G . 

The  condition  gv0  =  hv0  is  equivalent  to  the  condition  g~lh  £  N.  Because  g  is  an  automorphism 
of  G,  (gv0ihv0)  is  an  edge  of  G  iff  (v0,g~lhv0)  is  an  edge  of  G.  The  latter  is  equivalent  to  the 
condition  g“lh  £  g*N  =  F,0*  for  some  w  adjacent  to  vo .  Using  this  one  can  see  that  C  = 
Cayley(r,  I\  5)  where 

s=  u  9-n- 

•  or  {w,*o}€£ 


The  diameter  of  C  is  at  most  that  of  G,  because  if  •••»*>*  is  any  path  between  t>0  and 

Vk  in  G,  there  is  a  corresponding  path  of  the  same  length  in  C  among  the  coset  representatives 
•••»$»»•  (By  vertex-transitivity  it  suffices  to  consider  only  paths  from  v0.)  Hence  the 
diameter  of  C  is  at  most  A.  (It  is  not  hard  to  see  that  the  diameters  are  in  fact  equal.) 

Let  f(g)  :  T  -*  V  be  the  projection  that  takes  g  to  gv 0.  We  know  by  Lemma  2.16  that  / 
maps  the  same  number  |tf|  elements  to  any  vertex  t».  Also,  It  is  easy  to  see  that  this  projection 
preserves  neighborhoods  in  the  sense  that,  Nbd(/“l(A))  —  f  1  (Nbd(A)),  for  A  C  V.  It  follows 


from  Theorem  2.27  that 


|Nbd(j4)|  _  |/->(Nbd(A))l 

~W~~  i/-lU)i  - 


Remark:  Theorem  2.27  was  obtained  and  presented  in  lectures  in  1987,  but  is  appearing  in  print 
for  the  first  time.  Babai’s  Theorem  2.28  supersedes  a  result  in  Diaconis  [DS89]  for  distance-transitive 


2.3.  MARKOV  CHAINS  BASED  ON  GROUPS 


49 


1 


1 


graphs;  all  such  graphs  are  necessarily  also  vertex-transitive.  See  [Big74]  for  definitions.  Babai’s 
recent  result  was  incorporated  in  the  final  draft.  □ 


2.3.7  Eigenvalue  Eounds  for  the  Chains 

Based  on  Theorem  1.17  we  can  now  give  eigenvalue  bounds  for  symmetric  walks  on  groups,  based 
on  the  magnification  bounds  given  in  the  preceding  section  for  their  underlying  graphs.  If  better 
bounds  are  known  for  the  magnification  of  the  underlying  graph,  one  should  appeal  directly  to 
Theorem  1.17. 

For  the  following  two  theorems,  we  assume  that  T  is  a  finite  group,  p()  is  o  symmetric  transition 
measure  (p(g)  =  pig'1))  whose  support  S  =  Supp(p(  ))  generates  T,  P  is  the  associated  walk  on  T , 
and  T  acts  transitively  on  the  set  X  with  isotropy  subgroup  N . 


Theorem  2.29  Lei  A  be  the  diame er  of  Cayley  (T,  T,  5).  The  second  largest  eigenvalue  of  P  sat - 
isfies 

HP)  <  1 


4  +  2c2  ’ 


where  c  =  1/2A  and  p  =  minf€5_<i<j)  p(s)- 

Let  A'  be  the  diameter  of  Cayley  (I\A\S)  if  S  is  closed  under  conjugacy,  and  let  A'  =  A  other¬ 
wise.  If  P'  is  the  directly  induced  walk  on  X,  then  Ai(P')  satisfies 


HP')  <  i " 


4  +  2cJ  ’ 


with  p'  =  min, $s-N  p(9 If)  c  —  1/2A'. 


Proof:  For  the  chain  P,  the  underlying  graph  is  Cayley (T ,T,S).  For  P'  the  underlying  graph  is 
Cayley(r,  X,  S).  In  each  case  the  value  p  is  the  minimum  non-holding  transition  probability.  The 
eigenvalue  bounds  are  obtained  by  applying  Theorem  1.17  together  with  Theorem  2.27. 

We  are  justified  in  using  the  smaller  value  of  A'  when  Supp(p(-))  is  dosed  under  conjugacy, 
because  Theorem  2.21  insures  that  the  underlying  graph  is  vertex-transitive.  Thus  we  can  apply 
Theorem  1.17  with  Theorem  2.28  instead  of  Theorem  2.27.  B 


Theorem  2.30  Let  p  be  a  symmetric  N -bi- invariant  measure  with  support  S  generating  T,  and  let 
P  be  the  symmetrically  induced  Markov  chain  on  X .  If  A  is  the  diameter  of  the  underlying  graph 
of  P  then 

where  c  =  1/2  A,  and  p  =  min,€s_w  p[gN). 


/ 


50 


CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 


Proof:  The  underlying  graph  is  not  necessarily  the  Cayley  graph  of  the  group  action,  but  by 

the  symmetry  condition  (2.4),  it  is  necessarily  vertex-transitive.  Again,  p  is  the  minimum  non¬ 
holding  transition  probability  and  the  eigenvalue  bounds  are  obtained  by  applying  Theorem  1.17 
with  Theorem  2.28.  21 

These  bounds  can  be  applied  directly  with  Theorem  1.8  to  get  convergence  bounds.  We  give 
some  examples  in  the  next  section. 

Remark:  A  similar  bound  holds  for  any  symmetric  ergodic  Markov  chain  whose  underlying  graph 
is  vertex-transitive  by  combining  Theorem  2.28  and  Theorem  1.17.  But  it  seems  that  the  most 
interesting  cases  are  covered  here.  □ 


2*3.8  Examples 

We  now  show  some  examples  and  compare  to  tighter  bounds  where  they  are  known.  The  convergence 
rates  here  are  all  obtained  by  converting  the  chain  P  in  question  to  a  strongly  aperiodic  chain  (/  + 
P)/2,  and  then  applying  the  bounds  of  the  previous  section  with  Theorem  1.8.  Note  that  converting 
to  the  strongly  aperiodic  chain  decreases  the  minimum  transition  probability  by  a  multiplicative 
factor  of  | ,  slowing  the  convergence  bound  by  approximately  a  factor  of  2. 

The  examples  seem  to  display  that  the  bounds  are  easily  applicable,  but  usually  not  very  tight. 

Example  2.31  Simple  Random  Walk  on  the  Affine  Group.  Consider  the  random  walk  on  the 
affine  group  Ap  when  p  is  an  odd  prime  and  2  is  a  primitive  root  modulo  p.  Ap  has  p(p—  1)  elements 
and  the  9  elements  {(a,  fr)  |  a  €  {1, 2,  p+  1/2},  b  €  {0, 1,  -1}}  are  generators  with  A  =  O(p).  We  get 
an  immediate  0(p2  In  p)  convergence  guarantee  (for  fixed  positive  c  <  1).  Diaconis  [Dia8 8] [Example 
4,  pp.  34-35]  uses  a  neat  trick  to  make  the  harmonic  analysis  work,  and  gets  a  slightly  weaker  bound 
that  is  0(c(p)p3  In  p)  steps,  where  c(p)  is  any  function  that  grows  to  infinity  with  p.  By  considering 
the  projected  walk  obtained  by  tracking  only  the  first  coordinate  o  of  each  successive  position  (a,  b) 
in  the  walk,  it  is  not  hard  to  see  that  cp2  steps,  where  c  is  fixed,  is  not  sufficient.  Hildebrand  [Hil90] 
has  recently  shown  that  0(c(p)p2)  steps  are  sufficient,  where  again  c(p)  is  any  function  growing  to 
infinity  with  p.  □ 

Example  2.32  Random  Transpositions.  Consider  the  random  walk  on  the  symmetric  group 
generated  by  the  measure  p(id)  =  1/n  and  p($)  =  2/n7  if  g  is  a  transposition.  Here  we  have 
A  =  n—  1,  p  =  ^y,  and  an  immediate  eigenvalue  bound  of  approximately  1  —  8 .  In  actuality, 

Ax  =  1  —  The  transpositions  form  a  single  conjugacy  class,  so  the  transition  measure  is  constant 
on  conjugacy  classes.  Using  harmonic  analysis  and  a  full-spectrum  bound,  one  finds  O(nlnn)  steps 
are  sufficient  to  get  a  nearly  uniform  permutation,  and  this  is  tight  [DS81]  [Dia88].  Our  second- 
eigenvalue  bound  gives  a  much  weaker  0(n5  In  n)  bound,  but  is  a  one  line  calculation.  D 


2.3.  MARKOV  CHAINS  BASED  ON  GROUPS 


51 


Example  2.33  Transposition  and  (n  -  l)-cycle,  Adjacent  Transpositions.  The  symmetnc 
group  S„  is  generated  by  a  single  transposition  (1  2)  and  the  (n  -  l)-cyc!e  (2  3  4  •  •  •  n).  Let  p  be 
the  uniform  distribution  on  (1  2),  (2  3  4  •••  n)  and  its  inverse  (n  (n  -  1)  •••  32).  The  diameter 
satisfies  A  =  0(n2),  and  for  the  strongly  aperiodic  form  we  have  p  =  1/6.  The  resulting  bounds  are 
p,  =  fl(l/n4),  and  convergence  in  0(n5lnn)  steps.  This  is  probably  way  off,  but  no  better  bounds 
are  known.  The  transition  measure  is  not  constant  on  conjugacy  classes,  and  harmonic  analysis 
seems  intractable.  A  clever  coupling  may  work,  but  one  is  not  evident.  The  walk  generated  by 
adjacent  transpositions  (those  of  the  form  (t,  i  +  1)),  has  |S|  =  n  -  1,  A  =  O(n’)  (recall  “Insertion 
Sort”),  and,  thus,  a  similar  0(n!  In  n)  bound.  Agar  this  seems  off  the  mark,  but  provides  the  only 

published  bound.  □ 

Example  2.34  Two-Urn  Bernoulli-Laplace.  Consider  the  two-urn  Bernoulli-Laplace  model 
with  Jfc  balls  in  the  left  urn  and  n  -  ik  in  the  right  urn.  From  our  earlier  examples,  this  can  be 
viewed  as  a  chain  induced  symmetrically  from  an  -bi-invariant  transition  measure.  Without  loss 
assume  Jfc  <  n/2.  We  have  A  =  Jfc,  and  p  =  l/2Jfc(n  -  k),  and  an  immediate  eigenvalue  bound  of 
<  i  _  i/(32Jfc3(n  -  Jfc)  +  4Jfc(n  -  Jfc)).  This  is,  again,  far  off  the  mark.  As  mentioned  earlier, 
Diaconis  and  Shahshahani  [DS87]  get  a  tight  full-spectrum  convergence  bound  and  find  that  the 
second  eigenvalue  is  exactly  A,  =  1  -  We  can  match  their  convergence  bounds  by  our 

earlier  couplings  for  this  case.  See  Theorem  2.10.  O 

Example  2.35  Many-Urn  Bernoulli-Laplace.  For  three  or  more  urns,  the  Bernoulli-Laplace 
process  can  still  be  induced  symmetrically  from  an  N-bi-invariant  transition  measure,  but  (G,  N) 
is  not  a  Gelfand  pair.  The  harmonic  analysis  seems  intractable.  We  can  still  apply  Theorem  2.30. 
For  n  =  3Jb  balis  in  m  =  3  urns,  and  the  strongly-aperiodic  form,  the  diameter  of  the  underlying 
graph  is  2Jfc  and  all  non-holding  transition  probabilities  are  equal  to  p  =  l/4kJ.  This  gives  a  bound 
of  Ar  <  1  -  ot  *bout  1  -  STT'+n7'  We  do  better  with  the  couplins  bound  of Theorem  2.11. 

□ 

Example  2.36  Near-Uniform  generation  in  groups  with  polynomial  diameter.  The©- 
rem  2.29  implies  that  if  T  is  a  finite  group  presented  as  n  generators,  with  a  diameter  guaranteed 
bounded  by  a  polynomial  p(n),  then  there  is  a  near-uniform  generator  for  T  that  runs  in  time  poly¬ 
nomial  in  n  and  the  cost  of  multiplication  in  the  group.  Surprisingly,  these  ideas  can  be  extended 
to  certain  infinite  cases.  See  [Bab90].  □ 

Remark:  “Reasonable”  shuffles.  Intuitively,  any  “reasonable"  set  of  generators  of  Sn  has  a 
Cayley  graph  with  diameter  polynomial  in  n,  and  the  walk  on  such  a  graph  will  therefore  have  a 
polynomial  convergence  guarantee.  A  good  project  would  be  to  make  these  ideas  precise.  It  is  known 
that  with  probability  3/4  two  randomly  chosen  elements  of  S,  will  generate  the  group.  If  one  can 
get  a  handle  on  the  distribution  of  the  diameter  under  two  generators,  it  would  be  possible  to  prove 


52  CHAPTER  2.  URN  MODELS  AND  GROUP  ACTIONS 

a  bound  using  these  theorems.  It  is  hard  to  conceive  of  two  generators  which  yield  a  diameter  that 
is  not  0(n2).  In  a  related  direction  Babai,  Kantor,  and  Lubotsky  [BKL89]  prove  that  there  exists  a 
constant  C  such  that  every  non- Abelian  finite  simple  group  has  a  set  of  generators  5,  with  \S\  <  7, 
for  which  the  diameter  of  Cayley(F,  f,  5)  is  at  most  Cln  |r|.  □ 


Chapter  3 

Mean- Value  Estimation 


Often  one  wants  to  determine  the  mean  value 


hi  =  £  h(vMv) 

•  €V 

(3.1) 

of  a  real-valued  function  h  under  some  distribution  x  on  a  finite  set  V.  If  V  is  very  large,  or  h  is 
sufficiently  complex  to  analyze,  computing  this  value  exactly  may  be  prohibitively  expensive.  In 
this  case  one  often  resorts  to  estimates  obtained  by  sampling. 

If  one  is  able  to  draw  independent  samples  according  to  x,  one  generally  makes  an  estimate  using 

the  sample  mean 

>lB  =  i  £>■!-.). 
n  rr? 

(3.2) 

where  the  random  variables  V,  are  the  samples.  This  is  an  unbiased  estimator,  which  means  that 

E  [An]  =  hit 

(3.3) 

and  its  variance  is 

Var[j4n]  = 

n 

(3.4) 

where 

(3.5) 

»€V 


is  the  variance  of  h{Y)  when  Y  is  a  single  element  drawn  according  to  x.  If  bounds  on  h2  are  known, 
ChebyshefTs  inequality  may  then  be  used  to  bound  the  probability  that  the  estimate  lies  outside  a 
given  interval  of  h\.  If  n  >  then  the  estimate  An  satisfies 

Pr{|A.  -  M  >  P)  < 

The  following  well-known  lemma  can  then  be  used  to  rapidly  decrease  the  probability  of  error. 
(This  has  been  called  the  ‘Powering  Lemma’  in  [JVV86].) 


/ 


/ 


53 


54 


CHAPTER  3 .  MEAN-VALUE  ESTIMATION 


Lemma  3.1  (Median  Lemma)  Let  a%  for  l  <  i  <  m  be  m  random  estimates  of  a  such  that 

Pr{ia,-o|  >0}  < 

for  each  a*  independent  of  any  previous  estimates.  Let  M  6c  the  median  of  the  a,.  (If  m  is  even, 
take  the  average  of  the  two  candidate  medians .)  Then 

Pr{|Af  —  a\  >  0}  <  2~m. 

Note:  The  constant  1/4  may  be  replaced  by  any  constant  c  <  1,  if  2~m  is  replaced  by  cm/2. 

Proof:  Since  M  is  the  median  of  the  estimates  a*,  at  least  half  (fm/2l)  of  the  q,  axe  at  least  M 
and  at  least  half  are  at  most  Af .  Thus  if  Af  lies  outside  the  interval  [a-  0,a  +  0]  at  least  I’m/2‘1  of 
the  estimates  a,  do.  Applying  the  independence  condition  with  the  law  of  of  conditional  probability, 
we  have  Pr{|A/  -  a|  >  0}  <  (£)rm/21  <  2~m.  0 

Thereby  fig  1/6 ]  independent  samples  are  sufficient  to  get  an  estimate  M  of  the  mean  value 

that  is  within  0  of  h\  with  probability  at  least  1  —  6.  This  method  of  estimating  the  mean  will  be 
called  the  median-of-sample-means  algorithm.  Of  substantial  special  interest  is  the  case  when 
h  takes  values  in  {0, 1}  and  x  is  uniform  on  V.  If  H  ss  {v  \  h(v)  =  1},  then  h  is  called  the  indicator 
function  of  the  set  H.  The  mean  value  of  A  is  then 

h'-p~  FT 

Typically,  this  situation  comes  up  in  the  problem  of  approximate  counting,  when  one  knows  |l'|  and 
wants  to  estimate  |JET|,  or  vice  versa.  In  this  case  An  is  binomiaUy  distributed  with  parameters  n 
and  p.  Chernoff-type  bounds  on  the  tails  of  the  binomial  distribution  may  be  used  to  get  very  good 
bounds  on  the  probability  of  given  error. 

These  are  basic  results  for  mean-value  estimation  when  a  source  of  independent  and  identically 
distributed  (i.i.d.)  samples  is  readily  available.  The  purpose  of  this  chapter  is  to  provide  some 
analogous  analysis  of  the  sample  mean  and  the  median  of  many  sample  means  when  the  samples  are 
drawn  from  a  time-reversible  Markov  chain  with  stationary  distribution  x  on  V.  The  motivation 
for  this  investigation  is  the  Markov  chain  simulation  method  for  sampling.  The  basic  method  is 
to  run  a  Markov  chain  on  V  with  the  desired  stationary  distribution,  running  the  chain  until  its 
distribution  is  stationary  or  nearly  so,  and  then  drawing  samples  from  the  stationary  chain.  This 
chapter  lays  the  groundwork  for  the  next  chapters,  and  provides  the  basic  tools  for  the  empirical 
work  in  Chapter  5.  Here  is  an  outline  of  the  chapter. 

Let  {A**  |  k  >  0}  be  a  Markov  chain  with  transition  matrix  P .  To  avoid  technicalities,  we 
assume  Xo  is  drawn  according  to  x,  so  that  the  chain  evolves  in  the  stationary  distribution.  All 
of  the  bounds  here  hold  approximately  when  Xq  is  chosen  from  a  distribution  close  to  x.  We  will 


55 


consider  the  estimator 

Ci=i  £  h( Xti),  (3  fi) 

n  ^ — 

0<  i  <  n 

which  is  the  sample  mean  when  the  samples  are  drawn  every  t  steps  in  the  chain.  We  will  analyse 
this  estimator  and  estimates  based  on  the  median  of  these  sample  means. 

The  following  is  an  easy  consequence  ol  the  approximation  properties  of  nearly-independent 
samples  discussed  in  Appendix  B. 

Theorem  3.2  Let  P  be  an  ergodic  Markov  chain  with  stationary  distribution  x  and  convergence 
guarantee  7>(c).  Let  X0  be  drawn  from  any  distribution  x0  within  ratio  1  +  e  of  x  and  suppose 
t  >  7>(e/2n).  Then  the  distribution  of  the  resulting  estimator  Cl  is  within  ratio  (1  *f  c)  of  the 
distribution  of  the  sample  mean  An  based  on  n  i.i.d.  samples. 

Proof:  This  follows  immediately  Theorem  B.l  and  Theorem  B.2.  0 

However,  in  this  chapter  we  will  investigate  the  performance  of  the  estimator  when  samples  are 
drawn  t  steps  apart  from  the  stationary  (or  nearly  stationary)  chain,  where  t  may  be  much  smaller 
than  the  time  to  stationary. 

For  any  t,  including  the  case  t  =  1,  linearity  of  expectation  still  gives 

E [Cl]  S  hi.  (3.7) 

That  is,  the  estimator  is  unbiased.  However,  the  samples  will  in  general  be  correlated,  and  the 
variance  will  involve  covariance  terms. 

In  Section  3.1  we  analyse  these  covariance  terms  in  terms  of  the  spectrum  of  the  chain.  Making 
a  minor  extension  to  a  theorem  of  Aldous,  we  obtain  tight  worst-case  bounds  for  the  variance  of  the 
estimator  Cl  in  terms  of  n,  t,  and  Ax,  the  second  largest  eigenvalue  of  the  chain. 

The  variance  bounds  can  be  used  with  Chebysheff’s  inequality  to  bound  the  probability  that  the 
estimate  outside  a  given  interval  of  h\.  Then  the  traditional  median  trick  in  Lemma  3.1  can  be 
applied  to  obtain  accurate  estimates  with  high  probability. 

We  can  also  compare  the  variance  of  sample  means  based  on  truly  independent  samples,  and 
samples  obtained  from  the  stationary  Markov  chain  at  various  time  intervals.  The  following  tradeoff 
is  investigated.  Taking  t  as  large  as  the  time  to  stationarity,  gives  very  ‘high  quality’  samples  that 
are  very  nearly  independent,  and  the  accuracy  of  Cl  will  nearly  match  that  of  independent  samples. 
However,  taking  t  large  also  means  we  are  expending  computational  effort  to  simulate  many  steps  of 
the  chain  between  samples,  but  ignoring  the  ‘information’  in  the  intervening  states.  In  particular, 
when  only  a  Ax-based  bound  on  convergence  is  known,  it  turns  out  that  it  is  usually  better,  from  a 
computational  standpoint,  to  take  t  small  and  use  this  ‘information’  despite  the  added  correlation 
effects. 


56 


CHAPTER  3.  MEAN-VALUE  ESTIMATION 


In  Section  3.3  we  consider  the  special  case  when  h  takes  values  in  {0,  1}  and  *  is  uniform.  For 
large  t,  C*  is  approximately  binomially  distributed,  and  Chernoff  bounds  are  available  for  the  sample 
mean.  We  discuss  the  relative  merits  of  close  and  well-spaced  samples  when  such  tight  tail  bounds 
are  available,  and  still  conclude  that  using  samples  separated  by  only  a  few  steps  is  typically  more 
efficient  than  taking  nearly  independent,  well-spaced  samples. 

3.1  Variance  Bounds 

In  this  section  we  give  tight  worst-case  bounds  on  the  variance  of  the  estimator  Cn  defined  in  (3.6), 
when  X0  is  drawn  from  the  stationary  distribution.  In  this  case  the  chain  continues  to  evolve  in  the 
stationary  distribution,  and  as  mentioned  in  (3.7),  is  an  unbiased  estimator. 

We  will  be  using  the  notation  Af,  J£,  B}  and  T  and  2,  introduced  in  Section  1.2  for  the  spectral 
representation  of  the  reversible  chain  P.  The  uncertain  reader  should  refer  back  to  that  section  for 
definitions.  We  will  not  re-define  the  notation  here. 

Our  analysis  closely  follows  the  line  of  Aldous  [AM87],  but  we  retain  some  lower  order  terms  in 
order  to  obtain  a  closely  matching  worst-case  lower  bound. 

We  will  assume,  without  loss  of  generality,  that  hx  =  0;  this  assumption  does  not  affect  the 
analysis,  but  leaves  h2  =  *{v)h*(v)  wllich  *  fimPlcr  to  handle. 

First  write  ' 

Var[C£]  =  nh2  +  2(n  -  k)Etk  •  (3*®) 

11  kzzl 

where 

Ek  =  E[h(Xo)h(Xk)]. 

This  is  just  the  standard  expansion  for  the  variance  of  a  sum  random  variables;  the  first  term  is  the 
sum  of  the  variances,  which  may  be  interpreted  as  the  ,tindependent,*  component,  and  the  second 
term  is  the  sum  of  the  covariance  terms.  To  get  the  form  above  we  have  noted  that  the  covariance 
£ [h(Xj)h{Xj+k)}  between  any  pair  of  values  h(X>)  for  0  <  j  <  n  -  k  is  the  same  Ek- 

A  given  covariance  term  Ek  appears  2(n-  k)  times,  once  for  each  ordered  pair  at  separation  k  (twice 
for  each  unordered  pair).  We  want  to  bound  this  sum  of  covariance  terms. 

Since  Xo  is  chosen  according  to  r  we  get,  for  any  k, 

hix)*(z)h(y)pM.  (3.9) 

«€V,€V 

Apply  Theorem  1.3  to  rewrite  in  it*  spectral  representation,  and  get 

Et  =  X]£M*)»(*)M»)t/fM^1.r„rr. 

*.l  •  *  v  ' 


(3.10) 


3.1.  VARIANCE  BOUNDS 


57 


~  » ( £  M  x )  n/ x(g)  \/*{y) My) » ) 

ti  r  t 

(3.11) 

w  *  J 

(3.12) 

« 

(3.13) 

In  moving  between  the  last  two  lines  we  have  simply  called  the  the  inner  sum  v„ ;  this  vw  is  just  the 
projection  of  the  vector  KR  on  the  basis  vector  T..  in  the  orthonormal  basis  of  eigenvectors  of  M. 
Since  Tt,  =  >/*(*)  we  tave 

v,  =  Yi  m*m*)  =  =  °  £  »i  =  =  t(i)h2(x)  = 

x  »  * 

(3.14) 

Now  one  is  tempted  to  bound 

£a<(A.)'%- 

(3.15) 

This  is  valid.  However,  this  will  yield  a  bound  on  the  variance  of  Cj  in  terms  of  A.  and  thereby  lose 
something.  Using  an  idea  of  Aldous,  we  resist  this  temptation,  and  obtain  a  bound  in  terms  of  Ai 
only,  but  only  for  odd  t.  The  difference  is  discussed  in  the  remark  following  the  theorem. 


Theorem  3.3  (Tight  Variance  Bounds)  Let  Xo,  Xt,. . be  n  states  drawn  t  steps  apart 
from  a  stationary  time-reversible  Markov  chain  P  with  state  space  V,  stationary  distribution  r,  and 
second-largest  eigenvalue  Aj.  Let  h  be  any  real-valued  function  on  V,  and  define 

hi  =  ^  *(t>)h(t>)  and  fcj  =  ^  w(u)(h(v)  -  hi)2. 

•ev  *€V 

If  t  is  odd  and  positive  the  mean-value  estimator  Ci  =  jj  S«=o  M-Vt,)  expected  value 

B[C«]  =  hi 


and  variance 


where 


and 


a(r,n) 


Var[C£]  <  a(r(t),n)h2, 

r(t)  =  1/(1  -  A* ). 


Moreover,  there  exists  a  function  h  such  that  for  every  t  and  n, 

Var[C;]  >  a(r(t),n)h2. 


58  CHAPTER  3.  MEAN-VALVE  ESTIMATION 

Proof  of  the  Upper  Bound 

The  expected  value  is  hi  by  linearity.  We  complete  the  analysis  of  the  variance  from  the  preceding 
discussion. 

Consider  now  the  sum  term  from  (3.8).  In  ligLt  of  (3.13)  it  is: 

J^2[n-k)E*  = 

k  =  l 

< 


w  - 1 


£2(n -*)£*“«£ 
Jt  =  l  * 

£X>(» -*)*?«* 

•  kzzl 


k  =  l 


since  vt  —  0. 

Now  the  maximum  of  each  inner  sum  is  attained  at  Ax*  This  is  because  for  A  <  0,  and  t  odd, 
the  inner  sum  is  an  alternating  sum,  and  it  is  bounded  above  by  zero.  When  A  is  positive  nowever, 
it  increases  monotonically  with  A.  So  it  attains  its  maximum  at  A  =:  Ax*  From  this  and  (3.14),  we 
have 

(3.16) 


Varies  <  Jfc* 


where 


«  =  »  +  2  -  *)*i‘  =  2  f£(n  -  k)\'A  -  n. 

*  =  1  \*=0  / 

Putting  the  sum  in  closed  form,  letting  r  =  r(t)  =  1/(1  -  A}),  and  simplifying,  gives 

2 


*  = 


(a,*+1)*  —  (n  +  l)Aj  +  n)  —  n 


=  2 r*  [(1  -  l/r)m+1  +  n/r  -  (1  -  1/r)]  -  n 
=  2 r*  ((1  -  l/r)"+1  +  n/r  -  (1  -  1/r)  -  n/(2rJ)] 


=  2rJ 


so  that 


Then  (3.16)  gives  the  desired  bound.  H 


Proof  of  the  Lower  Bound 


3.1.  VARIANCE  BOUNDS 


59 


First,  we  show  how  to  pick  an  h  such  that  Etk  =  X\kh2,  for  every  *.  Then  the  result  will  follow 
easily. 

Let  IV  be  an  eigenvector  of  M  corresponding  to  Aj.  We  use  Txt’  to  denote  its  xth  coordinate. 
Take  the  function  h  to  be  h  =  so  that  h(x)  =  T „./>/*(*)•  Then-  by  the  of 

the  matrix  T  we  have 

x 

=  Vv/^r,,. 

=  iv  r.,. 

=  0 


and 


h2  = 


=  53  t(*) 


(IV)* 

x(x) 


X 

=  1. 

By  the  same  orthogonality,  this  function  h  satisfies 


X»vSHr— {; 


Using  this  in  (3.12)  gives 


Etk  —  Aj*. 


Consequently,  by  (3.8)  this  h  will  attain 


Var[C‘]  =  =  a(r,n)h2. 

n 


and  the  result  follows.  B 

Remark:  If  we  had  used  \Ek\  <  Xk.h2  as  in  (3.15),  we  could  have  avoided  the  maximisation 

argument  in  the  proof  of  the  upper  bound  and  obtained  an  upper  bound  valid  for  both  odd  and 
even  t.  However,  this  would  have  lost  something  (since  A,  <  X.)  that  one  cannot  recover  in  a  lower 
bound  that  is  valid  for  all  t.  While  a  lower  bound  involving  X.  is  possible  for  even  values  of  t,  our 
lower  bound  is  valid  for  both  odd  and  even  t. 

There  is  a  good  reason  to  write  the  bound  in  terms  of  A,  rather  than  A..  Current  techniques  for 
bounding  the  rate  to  stationarity  are  based  on  A.,  while  most  techniques  for  bounding  eigenvalues 


60 


CHAPTER  3.  MEAN-VALUE  ESTIMATION 


can  bound  only  Aj.  Typically  this  problem  is  avoided  by  converting  to  a  related  strongly-aperiodic 
chain,  such  as  (7  +  P)/2,  whose  eigenvalues  will  all  be  nonnegative.  However,  this  type  of  conversion 
may  result  in  a  slowdown  of  a  factor  of  2.  One  consequence  of  this  theorem  is  that  when  using  a 
chain  for  estimation  we  need  not  make  this  conversion.  □ 

Remark:  In  the  case  that  V  is  an  Abelian  group  and  P  a  random  walk  on  the  gTOup,  vw  in 

(3.13)  corresponds  to  the  Fourier  coefficient  of  h  at  the  representation  associated  with  w,  and  Bvw 
corresponds  to  the  Fourier  coefficient  of  the  transition  measure  at  that  representation.  On  groups 
such  as  the  hypercube  {Z\)  or  the  circle  (Zn),  where  the  harmonic  analysis  is  quite  tractable  and  all 
of  the  eigenvalues  and  eigenvectors  can  be  easily  written,  it  may  be  possible  to  get  better  bounds  for 
specific  h.  For  example,  if  h  is  such  that  vw  is  known  to  be  small  for  the  large  eigenvalues  then 
a  better  bound  results  following  essentially  the  same  line  of  analysis  as  for  Theorem  3.3.  However, 
a  natural  situation  in  which  such  functions  h  arise  is  not  apparent.  O 


We  will  be  interested  primarily  in  the  upper  bound,  which  is  convenient  to  apply  in  the  following 
form. 

Corollary  3.4  With  the  same  setup  as  in  the  preceding  theorem,  consider  the  estimator  C'n,  taking 

n  =  f(2r(t)  -  1)*1 


samples  drawn  t  steps  apart,  with  t  odd.  Then  for  every  function  h  and  every  k>  1 

Var[Ci]  < 

Proof:  Apply  Theorem  3.3  with  n  —  (2r— l)fc.  Using  the  facts  that  t  s  1  and  (1— ~)  <  e  ^  —  2 
for  the  given  n,  we  get 


a(r,n)  = 
< 
< 

< 


2t-1  2t(t-1)  , 

n  n*  V  ' 

1  _  T(r~ 

I  (2r  -  l)2i* 

1 

I* 


Note:  for  the  typical  case  when  r  >  1,  the  inequality  is,  in  fact,  strict.  ■ 


The  interpretation  of  the  corollary  above  is  that  taking  f(2r(t)  -  l)k]  samples  drawn  t  steps 
apart  from  the  stationary  Markov  chain  gives  an  estimator  whose  variance  is  guaranteed  to  be  at 
most  that  of  the  sample  mean  based  on  k  independent  samples. 

Applying  Chebysheff’s  inequality  with  the  preceding  corollary,  immediately  gives  the  following 
confidence  interval  bound. 


3.2.  SAMPLING  TO  ACHIEVE  GIVEN  VARIANCE  61 

Corollary  3.5  (Chebysheff  Bounds)  For  n  >  (2 r(t)  -  l)k,  and  t  odd,  vie  can  guarantee 

Pi{|  Ctn-h1\>0}<^. 

In  particular  ifn>  (8r(t)  -  4)h2//3J<  then 

Pr{|C;-fca|>^>< 

Repeated  estimates  C*n  can  be  combined  using  the  median  trick  of  Lemma  3.1,  provided  that 
each  estimate  satisfies  the  bound  of  Corollary  3.5  independent  of  the  result  of  preceding  estimates. 
This  can  be  guaranteed  by  drawing  the  initial  sampling  point  X0  for  each  estimate  independently, 
or  even  nearly  independently.  This  is  discussed  further  in  Appendix  B.  (In  Chapter  4  we  even  show 
a  case  in  which  one  can  use  repeated  estimates  when  the  points  Xo  are  highly  correlated,  but  still 
have  certain  “sample  majority”  properties  resembling  those  of  independent  samples.) 

Remark:  One  interpretation  of  Corollary  3.5  is  that  the  estimator  Cf,  based  on  [(2r(t)  —  1)V) 
samples  is  essentially  as  good  as  the  sample  mean  based  on  k  pairwise-independent  ^-distributed 
samples. 

To  apply  our  bounds  in  practice  we  rely  on  an  upper  bound  on  h3.  If  h  is  a  member  of  a  class 
of  functions  where  hj  can  be  bounded  a  priori,  our  bounds  may  be  applied  directly.  For  example, 
when  h  is  the  indicator  function  of  a  subset  H  C  V,  and  *  is  the  uniform  distribution,  we  have 
h2  <  1  /4.  In  general,  however,  one  may  have  to  resort  to  estimates  of  h-  as  well.  Various  estimators 
for  hj  for  general  stationary  regenerative  processes  have  been  suggested  [Han57]  [Igl78]  [GI84]. 
Glynn  and  Iglehart  [GI87]  give  a  joint  central  limit  theorem  for  a  mean  and  variance  estimator,  and 
Calvin  [Cal90]  provides  some  additional  analysis  but  not  any  non-asymptotic  bounds.  D 

3.2  Sampling  to  Achieve  Given  Variance 

The  point  of  this  section  is  to  show  how  the  bounds  of  the  previous  sections  can  be  applied  to  answer 
a  basic  question  that  arises  when  computing  mean  values  by  the  Markov  chain  simulation  method. 
Namely,  we  discuss  the  question  ‘How  large  should  one  choose  the  spacing  t  between  samples  in 
order  to  obtain  estimates  with  a  given  guaranteed  variance  bound?’  The  answer  to  this  question 
will  also  answer  similar  questions  which  arise  when  applying  the  median-of-sample-means  method 
with  Chebysheff  bounds. 

Suppose  that  we  want  to  estimate  with  variance  about  hi/h  (the  variance  of  the  sample 
mean  based  on  k  independent  samples).  For  the  moment,  let  us  restrict  our  discussion  to  the 
typical  case.  Suppose  we  have  found  a  bound  on  Aj.  After  conversion  to  the  aperiodic  chain 
p'  - 1  +  pt  one  gets  a  convergence  guarantee  for  P'  whose  leading  term  is  h»(l/»m«)  or  about 
T  =  2r(l)ln(l/iTm»i»)  >  2r(l)ln  |V|,  where  r(t)  is  that  of  the  original  chain  P. 


62 


CHAPTER  3.  MEAN- VALUE  ESTIMATION 


By  taking  a  small  £,  we  use  much  of  the  ‘information*  in  the  states  that  we  pass  through.  However, 
we  need  more  samples  to  achieve  a  given  variance,  and  each  time  we  take  a  sample  we  must  compute 
the  function  h.  By  taking  a  large  t,  we  get  nearer  to  independent  samples  and  reduce  the  number 
of  samples  required.  We  compute  the  function  h  fewer  times.  In  tradeoff  we  spend  time  to  simulate 
many  steps  of  the  chain  between  samples. 

Frr  example,  suppose  we  take  t  >  T.  These  samples  are  roughly  independent,  and  perform  about 
as  well  in  variance.  (This  is  by  Theorems  B.l  and  B.2,  or  by  Corollary  3.4  since  x(t)  ft;  1  for  n  >  T.) 
This  will  mean  that  we  compute  h  only  about  Jb  times  to  get  variance  h7/ki  but  that  we  run  the 
chain  on  the  order  of  kT  steps. 

Alternatively,  we  might  take  t  =  1,  so  that  we  are  taking  successive  samples.  Then  n  = 
(2r(l)-  l)Jfc  samples  will  be  enough  to  guarantee  a  variance  at  most  h2/k.  So  h  will  be  com* 
puted  n  =  (2r(l)  -  l)Jb  times.  But  on  the  other  hand,  we  will  only  need  to  run  the  chain  for 
T  +  (Jb  -  l)(2r(l)  -  1)  steps  (T  to  get  the  initial  sample  Xo  and  ( k  -  l)(2r(l)  -  1)  to  get  the 
remaining  samples).  This  is  much  smaller  than  kT  if  the  state  space  V  is  large.  Thus,  if  the  cost  of 
computing  h  is  negligible  compared  to  the  cost  of  taking  a  step  of  the  Markov  chain,  taktng  successive 
samples  will  be  cheaper  than  taking  roughly-independent  samples  to  achieve  a  given  variance . 

We  should  also  consider  the  possibility  that  h  may  be  calculated  more  easily  incrementally  than 
from  scratch.  In  other  words,  it  may  be  that  computing  h(X*)  from  scratch  requires  significant 
resources,  while  it  may  be  easy  to  compute  h(A\)  when  we  are  given  h(Xi- 1),  and  some  information 
about  the  transition  that  took  Xi-x  to  A*.  For  example,  suppose  v  =  (vi, v2, . . is  an  integer 
vector  in  [m]n,  and  that  h(r)  =  This  requires  time  linear  in  n  to  compute  in  general. 

Suppose,  however  that  we  sample  using  a  chain  generated  by  moves  of  the  following  type:  if  X ,  = 
(vl5 1>2,  ...,!*»)  is  the  current  postion,  then  uniformly  choose  a  random  step  direction  a  £  {+1,  —1}, 
and  a  random  coordinate  j  £  [n];  then,  if  (t>>  +  <r)  £  [m],  move  by  adding  a  to  Vj\  otherwise,  remain 
at  the  current  position.  (This  move  generates  a  symmetric  ergodic  chain.  Spectral  analysis  of  this 
type  of  chain  Is  in  Chapter  2.)  For  this  chain  we  can  compute  h(Xi)  in  constant  time  from  h(Xi-x) 
and  the  knowledge  of  the  move  (<r,  j)  that  took  to  to  X% ,  because  h(X%)  =  h(X*.i)-f  2cry,  -f  1.  A 
trick  applies  to  the  computation  of  the  x2  statistic  on  tables  that  we  investigate  in  Chapter  5. 

Of  course  a  spacing  t  somewhere  between  1  and  T  may  be  best.  To  illustrate,  suppose  we  have  a 
function  h  which  is  fairly  expensive  to  compute,  requiring  about  the  same  cost  as  about  r(l)  steps 
of  the  chain.  Take  an  odd  t  near  2r(l).  This  makes  (2r(t)  —  1)  ft;  1.3, 1  Now  if  we  let  n  ft;  1.3k, 
computing  C £  will  take  T  +  2.6r(l)(k  -  1)  steps  of  the  chain,  but  has  the  additional  property  that 
we  need  only  compute  the  function  h  about  1.3Jb  times.  The  total  computational  cost  will  b*  the 
equivalent  of  about  T  +  4 hr  steps  of  the  chain.  If  r  is  large,  this  is  considerably  less  than  the 
approximately  T  +  2r2Jb  steps  needed  when  successive  sampling,  and  is  also  less  than  the  (T  -f  r)k 
steps  needed  for  nearly  independent  samples  drawn  at  spacing  T,  given  the  reasonable  assumption 

1  A*nraing  that  Aj  is  very  near  1,  we  h*ve  2t(2t(1))  —  1)  a:  (2/(1  “  «*))•  1  ft  1«3* 


3.2 .  SAMPLING  TO  ACHIEVE  GWEN  VARIANCE 


63 


that  T  >  3 r. 

In  general  it  seems  that  to  achieve  a  mean  value  estimate  of  given  variance,  when  only  a  bound  on 
Ai  is  known,  it  is  typically  better  to  take  samples  spaced  at  a  distance  significantly  smaller  than  the 
time  required  to  reach  stationarity.  This  is  somewhat  surprising.  The  best  results  in  this  direction 
come  when  the  walk  is  on  a  family  of  (d,  c)-magnifiers.  (For  a  definition,  see  Section  1.5.2.)  We 
explore  the  consequences  in  the  next  chapter. 

Here  we  illustrate  with  some  simpler  examples. 

Example  3. 6  (Random  Walk  on  the  Hypercube)  Let  P  be  the  natural  aperiodic  random  walk 
on  the  additive  group  (Zj,©),  (the  hypercube).  This  walk  is  generated  by  repeating  the  following 
step:  A  coordinate  i  is  chosen  uniformly  from  {0, 1, 2, .  ..,d}.  If  v  €  Z\  denotes  the  current  position, 
the  new  position  is  v  ©  t3  where  e0  =  (0, 0, 0,  •  •  ♦,  0),  and  for  j  /  0,  e:  has  a  single  1  in  the  jth 
coordinate.  The  chain  is  symmetric,  and  the  stationary  distribution  is  uniform  on  Z\.  The  entire 
spectrum  of  the  chain  is  easily  obtained  by  harmonic  analysis  [Dia88],  or  using  the  construction  of  the 
hypercube  as  the  Cartesian  product  of  2-cliques  (Chapter  2).  The  eigenvalues  are  the  values  1  - 
(j  =  0, 1, . .  .,d)  appearing  with  the  respective  multiplicities  (*).  A  A.  bound  alone  tells  us  T  =  0(d2) 
steps  are  sufficient  to  get  roughly  independent  samples.  Using  the  full  spectrum  one  can  show  that 
Ad(lnd+  c)  steps  are  sufficient  to  get  samples  within  total  squared  variation  y(exp(e“c)  -  1),  and 
the  leading  term  £dlnd  is  essentially  tight.  (See  [Dia88].)  However,  2r(l)  =  (d-f  1),  so  that  taking 
kd  adjacent  samples  instead  of  k  samples  spaced  d  In  d  steps  apart  gives  an  estimate  with  as  small  a 
variance,  and  saves  steps  of  the  chain.  This  is  an  artificial  example,  because  in  practice  we  would  not 
use  this  walk  for  sampling  from  the  set  of  binary  d-tuples.  However  it  illustrates  that  the  theorem 
may  yield  a  savings  for  sampling  from  the  chain,  even  when  a  tight  full-spectrum  bound  is  known. 
□ 

Example  3.7  (Permutations  via  Random  Transpositions)  The  following  is  a  well-known  lin¬ 
ear  time  algorithm  for  constructing  a  uniform  random  permutation  of  the  set  [n].  (It  is  attributed 
by  Knuth  [Knu68,  Vol.  2,  Sec.  3.4.2,  p.  140]  to  L.  E.  Moses  and  R.  V.  Oakford  [M063].)  Assume 
that  random(t,  n)  returns  a  random  integer  in  the  set  {i,  i  +  1,  •  •  •»  n}  in  constant  time. 

Random  Permutation 

for  1  <  i  <  n  do  p[i] «—  i 

for  1  <  i  <  n  -  1  do  begin 
j  «—  random(t,  n) 
swap  p[i ]  and  p[j] 

end 

This  yields  a  random  permutation  in  the  array  p  in  time  ©(n).  The  algorithm  uses  ©(nln  n)  random 
bits. 


64 


CHAPTER  3.  MEAN-VALUE  ESTIMATION 


Alternatively,  we  can  generate  a  random  permutation  by  repeated  random  transpositions,  using 
the  Markov  chain  generated  by  taking  moves  of  the  following  type:  uniformly  choose  an  unordered 
pair  {»,  j}  €  [n]J,  (also  allowing  i  =  j);  then  swap  p[»]  and  p[j].  From  the  analysis  of  this  walk 
in  [DS81]  and  [Dia88]  we  know  that  essentially  §nlnn  such  operations  are  both  sufficient  and 
necessary  to  make  the  permutation  nearly  uniformly  distributed.  So  it  would  seem  that  the  random 
walk  method  provides  a  strictly  inferior  method  of  sampling  permutations.  But  this  is  not  necessarily 
the  case  when  sampling  for  mean-value  estimation. 

From  [DS81]  we  know  that  Ai  =  1  -  £  (exactly)  for  this  chain,  and  thus  that  2r(l)  =  n.  Thus 
starting  at  an  initial  random  permutation  (which  we  can  generate  by  the  traditional  method),  in 
ifen  steps  of  the  walk  we  can  get  a  mean- value  estimate  (based  on  kn  samples)  with  variance  slightly 
smaller  than  the  variance  of  k  real  uniform  samples.  This  method  of  using  the  random  walk  takes 
©(ifcn)  time  and  G(knlnn)  random  bits,  for  both  resources  the  same  order  cs  required  to  build  the  k 
independent  samples.  So  the  random  walk  method  here  is  competitive.  The  winning  method  will  be 
determined  by  implementation-dependent  constants.  In  fact,  if  h  is  easily  computed  incrementally, 
the  random  walk  method  may  be  more  efficient  than  this  traditional  algorithm  for  the  problem.  □ 

Example  3.8  (Random  Subsets  via  Bernoulli-Laplace)  A  similar  situation  occurs  when  gen¬ 
erating  random  subsets.  R.  W.  Floyd  [Flo]  has  invented  the  following  beautiful  algorithm  for  pro¬ 
ducing  a  subset  S  of  cardinality  .n  from  the  set  [n],  distributed  uniformly  at  random  over  the  (m) 
possibilities.  Again,  assume  random(t,n)  returns  a  random  integer  in  the  set  {i,  i  +  1, . . . ,  n}  in 
constant  time. 

Random  Subset 

S  0 

for  j  *-  k  downto  1  do  begin 
r  «—  random(jf,  n); 

if  r  g  5  then  insert  r  in  5  else  insert  j  in  S 

end 

Assume  that  n  is  even  and  m  =  n/2  (any  value  ©(n)  will  do).  Then,  if  we  represent  the  set  5  using 
a  bit  string  (a  string  of  n  bits  with  bit  s  equal  to  1  if  s  €  S,  and  0  otherwise),  the  algorithm  runs  in 
time  0(n)  and  requires  ©(nlgn)  random  bits. 

The  two-urn  Bernoulli-Laplace  process,  for  which  we  gave  a  coupling  analysis  in  Chapter  2,  yields 
a  nearly-random  permutation  in  ©(nln  n)  steps.  Again,  while  it  seems  at  first  that  the  random  walk 
method  is  strictly  inferior,  this  is  not  necessarily  the  case  when  sampling  for  mean-value  estimation. 

For  this  process  we  know  Ax  =  1  -  [DS87].  Thus,  for  m  =  n/2,  we  have  2r(l)  =  n/2. 

Starting  from  a  uniform  random  subset  (which  we  can  generate  by  Floyd's  algorithm),  and  sucessively 
sampling  over  an  Jkn/2  steps  of  the  walk  we  can  get  an  unbiased  mean  value  estimate  with  variance 


3.3.  INDICATOR  FUNCTIONS 


65 


at  roost  h2/k.  This  means  the  random  walk  method  may  be  competitive  with  Floyd’s  algorithm. 
Both  methods  take  time  0(fcn)  and  use  ©(fcnlnn)  random  bits.  Again,  the  random  walk  method 
may  perform  better  if  h  is  easily  computed  incrementally.  □ 

Example  3.9  (Estimation  on  Random  Matchings)  Many  interesting  statistical  and  combina¬ 
torial  sample  spaces  may  be  associated  with  sets  of  permutations  where  the  image  of  each  position 
is  restricted  to  lie  within  a  given  set,  dependent  upon  that  position.  Such  spaces  are  naturally 
identified  with  the  set  of  matchings  (1-factors)  of  a  bipartite  graph.  The  counting  problem  for  such 
sets  is  provably  hard  (#P-complete),  so  there  are  good  reasons  to  believe  that  traditional  methods 
of  exact  uniform  sampling  in  this  space  will  not  be  fruitful.  (See  [JVV86].)  Broder  [Bro86]  sug¬ 
gested  a  Markov  chain  for  near-uniform  sampling  from  this  set.  Jerrum  and  Sinclair  [JS88]  gave  a 
system  of  canonical  paths  for  this  chain.  They  showed  that  when  the  underlying  bipartite  graph  is 
sufficiently  dense,  their  Cheeger-type  bound  using  these  paths  yielded  a  bound  of  X2  =  1  -  fl(n  12), 
implying  nearly-independent  samples  could  be  obtained  at  a  cost  of  0(n13  In  n)  steps  per  sample. 
Recently  2.  A.  Fill  used  the  same  paths  in  a  Poincare-type  bound  yielding  Ai  <  1  -  n(n_T)-  (This 
calculation  appears  in  [DS89].)  This  implies  that  nearly  independent  samples  can  be  obtained  at 
a  cost  of  0(n*  In  n)  steps  of  the  chain.  For  estimation  of  mean-values  on  the  space,  however,  this 
means  that  0(n7)  steps  per  sample  will  be  enough  to  yield  estimates  with  equally  small  variance. 
This  combination  of  results  makes  a  big  difference  in  the  range  of  n  for  which  any  practical  means 
of  accurate  estimation  is  currently  known.  A  substantial  improvement  also  applies  to  mean-value 
estimation  using  Jerrum  and  Sinclair’s  “all-matchings”  chain  Q 

3.3  Indicator  Functions 

We  now  consider  the  special  case  where  h  is  the  indicator  function  of  a  set  H  and  *  is  the  uniform 
distribution.  Recall  that  h  is  called  the  indicator  function  of  the  subset  E  if  h  :  V  —  {0, 1}  and 
ff  =  {u  |  h{v)  =  1}.  (Mathematicians  would  usually  refer  to  these  as  characteristic  functions, 
a  term  which  we  avoid  because  it  has  a  different  meaning  for  statisticians.)  The  mean  value  of  h 
under  r  is  then  hi  =  p  =  j$|.  Estimating  the  mean  value  of  an  indicator  function  turns  out  to  be 
the  case  of  principal  interest  in  most  practical  problems;  one  wants  to  estimate  the  proportion  of 
elements  of  a  set  that  satisfy  a  given  criterion.  Indicator  functions  also  arise  when  one  estimates  a 
probability  density  of  a  random  variable  by  sampling.  One  partitions  the  range  of  the  variable  into 
a  number  of  subranges  called  ‘bins.’  Then  one  counts  how  many  samples  fall  into  a  given  bin.  The 
graph  of  the  result  is  called  a  histogram.  Estimating  the  density  in  this  way  may  be  viewed  as 
simultaneously  estimating  the  mean  value  of  the  indicator  function  associated  with  each  bin. 

If  we  draw  samples  y1,T2,..,Vn  independently  and  uniformly  from  V  then  each  value  h{Yi)  is  a 
Bernoulli  random  variable  with  parameter  hx  =  p.  The  single-sample  variance  is  hj  =  p(l-p)  <  1/4. 
The  sum  £*=l  h{Yi)  is  distributed  Binomial(n,p),  and  the  following  well-known  Chemoff-type  bound 


66 


CHAPTER  3.  MEAN-VALUE  ESTIMATION 


provides  a  means  of  bounding  the  relative  error  in  the  estimator  An  of  (3.2). 

Lemma  3.10  If  X  —  Binomial  (n,  p),  and  q  =  1  —  p  then 

Pr  {--p>P}  <  e-e'n/ (3.17) 
n 

Pr{—  +  p  <  P}  <  e-'3’n/4M  (3.18) 

n 

Proof:  See,  for  example,  [CLR90,  Chapter  6].  13 

From  Theorem  3.2,  we  know  that  the  distribution  of  C*  for  large  t  must  also  be  approximately 
Binomial(n,p).  Combining  this  with  Len  j:  a  3.10  above,  we  get  the  following  theorem. 

Theorem  3.11  (Chernoff  Bound)  Let  {Xk  \  k  >  0}  be  a  time-reversible  Markov  chain  with 
transition  matrix  P,  stationary  distribution  v,  and  convergence  guarantee  7>{e).  Lett  >  T/>(e/2n). 
If  h  is  the  indicator  function  of  any  subset  H  C  V,  and 

then 

Pr{|Cj  -  p\  >  0}  <  2(1  +  t '**, 
where  p  —  h\  —  and  q  =  1  —  p. 

Proof:  The  result'  follows  by  applying  the  preceding  lemma,  Theorem  3.2  and  the  sum  rule  to  each 
tail.  B 

The  bounds  on  the  sample  means  obtained  in  this  way  are  certainly  stronger  than  Chebysheff 
bounds.  However,  for  mean-value  estimation,  they  do  not  represent  a  significant  improvement  over 
the  combination  of  the  Chebysheff  bounds  and  the  median-of-sample-means  method. 

Consider  these  two  alternatives:  (1)  Take  well-spaced  samples  and  compute  the  sample  mean, 
bounding  the  error  using  the  preceding  Chernoff  bounds.  (2)  Take  closely  spaced  samples  and  use 
the  Chebysheff  bounds  of  Corollary  3.5  together  with  the  median-of-sample-means  method  to  reduce 
the  probability  of  error.  In  either  case,  @((pg/02)ln(l/<5))  samples  will  be  necessary  and  sufficient 
to  obtain  an  estimate  within  (5  of  the  true  mean  with  probability  at  least  1-6.  However,  using 
the  Chebysheff  and  median  method  we  will  typically  be  able  to  achieve  the  desired  accuracy  with 
substantially  fewer  steps  of  the  chain. 

Let  T  =  2r(l)  In  |V|,  as  in  Section  3.2.  Once  again,  T  is  essentially  the  leading  term  in  our  usual 
second-eigenvalue  bound  on  the  time  to  stationarity.  From  Corollary  3.5  and  using  the  fact  that 
h2  <  1/4,  we  know  that  (2r(l)  -  l)/02  successive  samples  from  the  stationary  chain  axe  sufficient 
to  guarantee  that  Pr{|C*  -  p\  >  0}  <  1/4  for  any  indicator  function  h .  Thus  essentially 

(T  +  (2r(l)  —  !)/£*)  lg(l/6) 


3  4  CENTRAL  LIMIT  THEOREM 


67 


steps  will  allow  us  to  get  a  median-of-sample- means  estimate  M  that  satisfies 

Pt{\C'n-p  :>0}<i. 

Using  well-spaced  samples  (t  >  T)  we  would  need  about  TtQ7\%(\fb)  steps.  So,  again,  taking 
successive  samples  saves  us  a  factor  of  about  In  lUj  in  the  number  of  steps  of  the  chain  that  we  teed 
to  simulate. 

Remark;  The  other  issues  discussed  in  Section  3.2  concerning  the  cost  of  computing  h  and  the 
possibility  of  computing  **  incrementally  are  pertinent  here  as  well  and  should  be  considered  *hen 
making  a  choice  of  1.  C 


3.4  Central  Limit  Theorem 

In  the  ease  of  independent  samples,  the  Central  Limit  Theorem  tells  us  that  in  the  limit  as  n 
approaches  infinity,  the  variable 

has  the  normal  distribution  with  mean  0  and  variance  1,  and  tighter  non-asymptotie  bounds  are 
possible  with  appeal  to  theorems  of  the  Berry-Esseen  type  [Hal82j. 

There  are  various  versions  of  the  Central  Limit  Theorem  that  also  hold  for  Markov  chains.  These 
say  that  the  distribution  of  the  sample  mean  of  any  function  on  the  state  space  converges  to  a  normal 
distribution.  In  the  present  environment  we  have  the  following  asymptotic  version. 

Theorem  3.12  For  any  t  >  1,  the  estimator  C*n  with  A'o  drawn  according  to  *,  u  asymptotically 
normally  distributed  with  mean  kl  and  variance  Var[C* ;.  More  precisely,  for  each  fixed  real  x 

lim  Pr{  <*}  =  ♦(«). 

where 

=  -1=  [  e-'',7dv 

is  the  cumulative  distribution  of  a  standard  Normal  random  variable. 

Proof:  This  is  the  combination  of  Theorems  1,  p.  99,  and  Theorem  3,  from  [Chu60j.  B 

This  suggests  approximating  the  distribution  of  with  the  Normal  distribution  with  mean  hx 
and  variance  Var[C£].  We  can  combine  this  idea  with  our  bounds  on  the  variance  of  C*n,  and  use 
the  fact  that  a  given  confidence  interval  around  the  mean  of  a  Normal  distribution  shrinks  monoton- 
ically  with  shrinking  variance.  This  motivates  the  following  approximate  confidence  intervals.  We 
emphasise  that  these  are  only  approximate;  they  provide  some  intuition,  but  not  much  assurance. 


68 


CHAPTER  3.  MEAN-VALUE  ESTIMATION 


Consider  the  estimator  C*  with  t  odd,  and  X0  drawn  according  to  r.  For  n  large  and  at  least 
2r(f)  -  1  one  has 

Pr{iC*  -  hi\  >  0\Zh2/n}  is  approximately  founded  by  2$(~0)- 

To  illustrate,  taking  any  t  >  1  and  n  large,  (at  least  large  enough  that  n  >  max{2r(t}  -  lt  660^2})t 
should  give  C*  =  hx  ±  .01  with  approximately  99%  confidence.  (The  number  660h2  is  from  [Fel70, 
Vol.  1,  p.  245].)  That  is, 

Pr{|C*  -  h\\  >  .01}  is  approximately  founded  by  .01. 

Remark:  An  interesting  avenue  for  further  investigation  would  be  to  research  what  is  known  about 
non-asymptotic  versions  of  the  Central  Limit  Theorem  for  Markov  chains.  Likely  there  will  be  a 
spectral  formulation,  and  the  new  eigenvalue  bounds  from  Chapter  1  will  find  additional  applications 
there. 

Certainly,  for  large  sample  spacings  t  one  could  apply  Theorem  3.2  with  known  eigenvalue  bounds 
and  known  non-asymptotic  versions  of  the  central  limit  theorem  for  n  i.i.d.  variables,  (see  [Hal82]), 
to  get  rigorous  non-asymptotic  results  on  the  distribution  of  C^.  We  will  not  investigate  these  topics 
here. 

Using  the  same  techniques  as  in  Theorem  3.3,  bounds  for  the  higher  moments  of  sample  means 
from  the  stationary  chain  are  also  feasible.  However  a  notable  consequence  of  the  discussion  in 
Section  3.3  is  that  for  the  large  class  of  practical  problems  that  require  mean-value  estimates  for 
indicator  functions,  it  will  not  yield  substantial  payoffs  to  seek  higher  moment  bounds.  O 


Chapter  4 


Using  Expanders  in  Estimation 


Interesting  consequences  wise  when  the  variance  bounds  of  the  preceding  chapter  are  applied  to 
random  walks  on  expander  graphs. 

A  family  of  graphs  £  is  a  family  of  degree-d  (-enlargers  if  for  each  G  €  G,  G  has  degree  bounded 
by  d  and  the  first  positive  eigenvalue  i/j  of  the  graphical  Laplacian  Q(G)  —  D-A  satisfies  v\  >  «,  for 
a  fixed  e  >  0  independent  of  G.  Alon  [Alo86]  showed  that  this  is  equivalent  to  guaranteeing  that  Q  is 
a  family  of  magnifiers.  (In  Section  1.5.2,  we  discussed  this  result  m  the  Markov  chain  setting.)  Such 
expanding  families  have  been  studied  by  several  authors,  and  they  have  numerous  applications. 
A  number  of  explicit  constructions  of  infinite  families  are  now  known.  (See  e.g.  [GG81]  JM8S] 
[LPS86][AGM87j.) 

Theorem  4.1  (Expander-based  Estimates)  Let  Q  be  a  family  of  d-regular  t-enlargers  that  are 
not  bipartite.  Let  G  =  (r,£)  be  any  graph  in  Q.  Let  be  the  estimator  defined  in  (3.6)  where  P 
is  the  natural  random  walk  on  G  and  A'o  is  drawn  according  to  the  uniform  stationary  distribution, 
and  define 

T(t)  =  i  —  (i  —  */d)* 1 

For  any  function  h  on  V,  and  any  odd  t,  and  any  k  >  1  if  n  >  (2 r(t)  —  l)k,  then  we  have 

E[C*]  =  fci, 

&nd 

Var[Ci]  < 

Proof:  The  natural  random  walk  P  on  any  member  of  a  family  of  d-regular  (-enlargers  has  Xi(P)  < 
1  —  t/d.  This  follows  immediately  from  the  fact  that  L(P)  =  jQ(G).  and  Ai  =  1  -  Pi  =  1  -  •'iM 
(See  Section  1.5.1.)  Thus  for  this  chain  the  function  r(t)  =  1/(1  -  M)  **  identical  to  the  function 
T(t)  stated  in  the  theorem.  Combining  this  with  Corollary  3.4  yields  the  result  immediately.  B 


70 


CHAPTER  4.  USING  EXPANDE  D  IN  ESTIMATION 


Note  that  for  all  t,  we  have  r(f)  <  r(l)  =  d/e,  which  is  a  constant  independent  of  the  size  of 
the  vertex  set  So  roughly  speaking,  this  theorem  says  that  given  one  element  drawn  uniformly  at 
random  from  V,  we  can  get  pretty  good  additional  samples  (for  mean-value  estimation)  at  the  cost 
of  a  constant  number  of  steps  per  sample,  where  this  constant  does  not  depend  on  |V|.  Moreover, 
the  fact  that  the  family  is  degree-bounded  suggests  that  we  may  be  ab)'  to  take  each  step  of  our 
walk  using  a  constant  number  of  random  input  bits.  In  this  case  we  will  be  able  to  good  estimates 
at  a  random-bit  cost  much  smaller  than  the  cost  of  independent  samples. 

In  this  chapter  we  will  show  a  result  of  this  type,  and  how  to  merge  it  with  the  techniques 
in  [AKS87,  IZ89,  CW89]  to  produce  an  algorithm  for  estimating  the  mean  value  of  a  function 
h  :  {0,  l}n  — >  R  using  very  few  random  bits. 

If  hj  is  the  mean  and  h7  is  the  variance  of  h  under  the  uniform  distribution,  the  algorithm 
uses  n+  O(h2/07)  +  0(\g(l/6))  independent  random  bits  to  produce  O{(h7/02)  lg(l/<*))  samples  in 
{0, 1}*.  These  samples  are  used  in  the  standard  median-of-sample  means  algorithm,  to  obtain  an 
estimate  M  such  that 

Pt{\M  -  hx\  >  0}  <  6. 

The  scheme  presented  here  is  naturally  viewed  as  a  pseudo-random  generation  scheme  for  the  median- 
of-sample- means  algorithm,  which  usually  requles  ©(n(h2//32)lg(l/6))  independent  random  bits  to 
achieve  the  same  error  bounds. 

The  method  suggested  here  is  ultimately  not  optimal  at  saving  random  bits,  but  this  chapter  is 
intended  to  serve  as  an  example  of  the  application  of  the  results  of  the  last  chapter,  and  the  use  of 
expanders  for  sampling.  The  main  new  result  is  Theorem  4.3. 

Our  algorithm  uses  random  walks  on  expander  graphs  in  a  combination  of  two  separate  stages 
of  pseudo-random  generation.  The  techniques  of  [AKS87]  in  [IZ89]  and  [CW89]  do  not  alone  yield 
savings  as  good  as  we  present  here.  We  discuss  the  relationship  to  these  and  other  similar  results  in 
Section  4.7. 

4.1  Preliminaries 

We  will  assume  the  reader  is  already  familiar  with  the  notation  introduced  in  Chapters  1  and  3. 

Let  V  =  {0, 1}W,  where  n  is  even.  (In  practice,  if  n  is  odd,  one  can  ‘pad*  the  domain  V  by 
one  bit  to  make  n  even.)  Consider  a  function  h  :  V  — *  IL  The  mean  value  hi  under  the  uniform 
distribution  V  is 

=  F  £ 

•  €V 

•€V 


Define 


4.1.  PRELIMINARIES 


71 


This  is  the  variance  of  h{X)  when  A"  is  an  element  choeen  uniformly  at  random  from  V'.  We  will 
assume  throughout  that  h2  is  known,  or  at  least  that  an  upper  bound  is  known.  In  the  remainder 
of  this  chapter  h2  may  generally  be  replaced  where  it  appear*  by  *uch  an  upper  bound.  Note  that 
whenever  h  L  an  indicator  function  (i.e.,  h  :  V  — »  {0, 1}),  we  have  k2  <  since  it  is  equal  to  the 
variance  of  a  Bernoulli  random  variable. 

We  will  be  applying  the  following  magnifier  introduced  by  Alon,  Galil,  and  Milman  [AGM87]. 
The  graph  has  vertex  set  Zm  *  Zm,  for  arbitrary  nonnegative  integer  m.  On  this  vertex  set  define 
the  permutations 


tM*.  y) 

=  (*. 

V) 

*i(*.y) 

=  (*. 

V  +  2z) 

»j(*.y) 

=  (*. 

,y  +  2z+  1) 

M*.y) 

=  (* 

f  2y.  y) 

<N(*.y) 

=  (z 

~2y+ l,y) 

where  all  additions  are  modulo  m.  The  graph  is  obtained  by  connecting  every  vertex  v  =  (z,y) 
to  the  9  vertices  consisting  of  <r,(z,y)  and  the  inverse  images  <r~l(x,y),  0  <  i  <  4.  (Note  two 
things:  (1)  <r0  =  oj 1  is  the  identity  map,  and  (2)  we  use  the  term  graph  in  our  general  sense,  as 
defined  on  p.  xii.)  For  each  (even)  n,  let  G„  be  the  graph  obtained  by  letting  m  =  2n/J  in  the 
above.  Associate  with  each  vertex  (z,  y)  the  binary  n-tuple  corresponding  to  the  concatenation  of 
the  binary  representations  of  z  and  y.  In  this  way,  we  may  view  G»  as  a  magnifier  with  2B  vertices 
labelled  with  the  elements  of  V  =  {0, 1}". 

For  every  n,  the  graph  Gn  has  Laplacian  eigenvalue  i'i  >  8  -  5\/2,  independent  of  n.  (See  [3M85] 
and  [AGM87,  p.  341]).  This  implies  that  if  P  is  the  ttrongly  aperiodic  form  of  the  random  walk 
process  on  G«,  then  A.(P)  =  A,(P)  <  1  -  (8  -  &y/5)/lt)  <  .9485  for  every  n.  (We  work  here  with 
the  strongly  aperiodic  form  of  the  random  walk  on  G.  For  background  see  Section  1.4.2.)  Let 

t  =  23. 

This  value  is  large  enough  so  that 

M  <  1/10.  . 

and  for  this  t  we  have 

r(t)  <  10/9, 

where  r(t)  =  y~jr  is  the  same  function  as  that  used  in  Chapter  3. 

In  addition  to  i,  the  following  parameters  are  used  later.  For  now,  these  should  simply  be 
remembered  to  be  constants.  The  interpretation  of  these  constants  will  be  given  when  they  appear, 


72 


CHAPTER  4.  USING  EXPANDERS  IN  ESTIMATION 


but  their  precise  values  are  not  crucial  to  the  arguments. 

ci  =  2 

e2  =  225  =  f8tr(t)l 

c3  =  736  =  8tflg»l. 

Remark:  The  exact  choice  of  magnifier  is  not  crucial  for  our  purposes.  However,  the  following 
properties  are  used  and  the  reader  should  notice  that  the  graphs  G„  defined  above  have  these 
properties.  The  graphs  should  be  connected  in  order  to  guarantee  that  the  random  walk  on  the 
graph  is  irreducible.  The  graphs  should  be  regular,  with  degree  bounded  by  a  constant  in  n.  Also, 
the  graphs  should  have  a  compact  description  so  that  if  we  know  our  current  position,  the  choice 
of  one  of  the  edges  and  its  traversal  in  the  walk  can  be  simulated  by  an  efficient  computation  on 
the  binary  n-tuple  corresponding  to  the  current  vertex.  Also  desirable  is  that  the  eigenvalue  Ai  be 
bounded  well  away  from  1,  since  this  increases  the  efficiency  of  the  method.  O 


4.2  Outline  of  the  Algorithm 

Our  algorithm  for  mean-value  estimation  consists  of  three  stages  which  wc  first  summarise.  The  first 
two  stages  together  produce  a  set  of  samples  drawn  tom  V.  The  final  stage  produces  the  estimate  by 
using  the  samples  in  the  basic  roedian-of- sample- means  algorithm  described  in  Chapter  3.  A  precise 
description  of  the  new  algorithm  is  given  in  Section  4.6.  However,  it  will  not  be  very  readable 
without  the  background  provided  by  the  intervening  sections.  The  illustration  in  Figure  4.2  on  the 
next  page  may  be  helpful  in  interpreting  the  initial  summary  here. 

In  the  first  stage  we  use  s  +  0(lg(l/6))  random  input  bits  in  a  random  walk  on  G„  where 
s  =  n  +  O{h2/07).  This  walk  is  used  to  generate  0(lg(l/6))  (correlated)  random  strings  of  length 
s.  These  we  call  sampling  seeds. 

In  the  second  stage,  each  of  these  0(lg(l/£))  campling  seeds  is  used  as  the  random  input  needed 
to  specify  another  walk,  but  this  time  on  the  graph  Gn.  This  random  walk  consists  of  O{h2/07) 
steps.  Recall  that  the  vertex  set  of  Gn  is  V  =  {0, 1}",  so  this  is  the  sample  space  from  which  we 
want  to  draw  our  samples,  and  we  do  so  during  this  *7alk. 

Finally,  the  samples  obtained  from  each  sampling  seed  are  used  to  produce  a  sample  mean,  by 
evaluating  h  at  each  sample  and  averaging.  The  final  estimate  M  is  the  median  of  the  resulting 
0(lg(l/6))  sample  means.  This  M  will  lie  within  0  of  hi  with  probability  at  least  1-6. 


4.2.  OUTLINE  OF  TEE  ALGORITHM 


s  +  o(ig  a  / s )) 
random  bits 


where  2 

s  =  n  +  0(  hz  /P  ) 


Figure  4.2:  Schematic  Outline  of  the  Estimation  Algorithm. 


74 


CHAPTER  4.  USING  EXPANDERS  IN  ESTIMATION 


4.3  Sample  Means  from  Gn 

The  following  theorem  is  on  *  of  the  two  results  forming  the  basis  for  our  algorithm.  It  shows,  roughly 
speaking,  that  in  a  computing  the  sample  mean  of  any  real- valued  function  h  on  V  s  {0,  l}n,  we 
ran  replace  k  independent  samples  from  V  by  kf  =  0(k)  correlated  samples  which  we  obtain  with  a 
number  of  random  bits  much  smaller  than  nk  (the  entropy  of  the  k  independent  samples  from  V). 
Moreover,  this  substitution  does  not  increase  the  variance  of  the  sample  mean. 

Theorem  4.3  (Expander  Sample  Means)  Ltici,  c2t  and  t  be  as  defined  in  Section  4.L  There  is 
o  deterministic  algorithm  that  given  only  n  +  c2k  independent  uniform  random  bits ,  outputs  k  <  C\k 
random  binary  n-tuples  X\,  X2, ...» Xk •  with  the  following  properties . 

If  h  is  any  real- valued  function  on  {0,  l}n,  let  Sx  =  p  £t=:i  fln^  kf  j  h(Y%)f 

where  Y\,  Y2,  •  •  * » Ik  are  independent  uniform  binary  n-tuples. 

Then  for  every  h  we  have, 

E[Sjc]  =  E  [Sri  =  hi 


and 


Var[5x]  <  Var[5y]  =  h2/k. 


Proof:  We  generate  the  samples  X \  using  the  strongly  aperiodic  random  walk  process  P  on  the 
graph  Gn  above,  using  the  random  bits  to  make  the  necessary  choices.  Use  the  initial  n  bits  as  a 
binary  n-tuple  specifying  the  initial  vertex  of  the  walk.  This  starts  us  in  the  uniform  stationary 
distribution.  We  draw  the  samples  Xx  every  t  steps  apart  from  this  stationary  Markov  chain.  Now 
Theorem  3.3  tells  us  that  the  sample  mean  based  on  Jfe'  =  r(2rW  “  l)*l  <  ci*  *ucil  has  the 

stated  properties.  We  can  perform  the  strongly  aperiodic  walk  on  Gn  using  only  4  =  fig  9]  random 
bits  per  step,  son  +  4tf(2r(t)  -  l)Jb]  <n+c2Jfe  random  bits  are  sufficient  to  take  this  walk.  This 
gives  the  result.  S 

Remark:  The  constants  Ci  and  c2  depend  on  the  choice  of  t  set  in  Section  4.1.  Larger  t  will  bring 
C2  close  to  1,  but  make  ci  larger. 

Note  that  only  a  constant  number  of  random  bits  are  required  to  take  each  step  of  the  walk 
on  Gnt  but  each  step  requires  ©(n)  time  in  general.  If  random  bits  are  available  at  unit  cost,  the 
time  complexity  of  using  nk  random  bits  or  taking  k *  =  O(k)  steps  of  the  walk  are  the  same  0(nh). 
Here  we  are  concerned  with  saving  random  bits,  and  incurring  only  small  additional  time  costs.  O 

Corollary  4.4  There  is  a  deterministic  algorithm  that  given  0  >  0  and  n  +  uniform 

independent  random  bits ,  outputs  an  estimate  S  satisfying 


Pt{\S-hx\>0}< 


100* 


4.4.  MAJORITIES  FROM  G„ 


75 


The  algorithm  runs  in  time  that  if  linear  in  n,  h2,  and  the  cost  of  computing  h,  and  quadratic  in 

1/0. 

Proof:  The  estimate  5  is  simply  the  sample  mean  Sx  based  on  the  samples  obtained  by  the 

preceding  theorem.  The  inequality  is  ChebyshefTs.  The  running  time  bound  is  clear.  SI 

To  get  an  estimate  that  is  within  an  interval  of  width  0  around  hi  with  probability  at  least  1-6, 
we  could  now  simply  apply  Lemma  3.1  and  take  the  median  of  0(lg(l/6))  independent  estimates 
5  of  this  type.  However,  this  would  use  0((n  4  h2/02)  lg(l/^))  random  bits,  and  we  don’t  want 
to  incur  that  cost.  The  results  of  the  next  section  suggest  how  we  can  get  away  with  using  only 
n  4-  O(h2/02)  4*  0(lg(l/6))  random  bits. 

4.4  Majorities  from  Gn 

We  have  already  seen  that  (2r(i)  -  1  )k  samples  drawn  a  constant  t  steps  apart  from  Gn  have 
approximation  properties  similar  to  k  i.i.d.  samples  in  the  sense  that  sample  means  based  on  0{k ) 
samples  drawn  from  Gn  have  variance  matching  k  i.i.d.  samples.  We  now  show  how  the  same  type 
of  samples  also  have  properties  similar  to  iid  samples  under  certain  majority  tests. 

The  results  of  this  section  are  really  those  of  [AKS87]  [IZ89]  and  [CW89]  translated  slightly.  We 
will  show  that  if  we  consider  any  small  subset  B  of  the  vertices,  not  too  many  of  the  samples  we 
draw  will  lie  in  B.  This  provides  the  second  key  ingredient  for  the  algorithm. 

Define  the  projection  matrix  N  of  a  subset  B  C  V  as  as  the  matrix  |Vj  x  |Vj  indexed  by  the 
elements  of  V  such  that  Ay„  =  1  if  v  €  B  and  all  other  entries  of  N  are  sero. 

For  vectors  <p  €  -R,V|,  let  \<f>\2  =  (£„€V  4>2(v))I/2  denote  the  C2  norm,  and  let  \j>\i  =  \<t>(v)\ 

denote  the  C1  norm. 

We  begin  with  the  following  lemma. 

Lemma  4.5  (From  [IZ89])  Let  P  be  the  strongly  aperiodic  walk  on  Gn,  and  let  t  be  as  defined  in 
Section  4.1.  Let  B  C  V  be  a  subset  of  the  vertices  such  that  |£|/|Vj  <  1/100.  Let  N  denote  the 
projection  matrix  of  iAe  set  B,  and  M  denote  the  projection  matrix  of  the  set  B.  For  $  €  tee 
have 

(i)  \4>PlM\2  <  |*|3; 

(ii) 

Proof:  The  eigenvalues  of  Px  all  have  absolute  value  at  most  1,  so  |^P*|2  <  •  Since  M  is  a 

projection  matrix,  for  any  vector  <t>  we  have  \<t>M\2  <  |4|3|  because  multiplying  by  Af  effectively  just 
sets  some  components  of  ^  to  0  and  does  not  increase  any  other  components.  Combining  these  two 


76 


: 


CHAPTER  4.  USING  EXPANDERS  IN  ESTIMATION 

facts  proves  (i).  To  prove  (ii),  first  write  <f>  =  v  +  w  where  v  is  a  scaler  multiple  of  the  stationary 
eigenvector  (1, 1,  — ,  1)  and  u>  is  orthogonal  to  v.  Now  note 

|tfP‘i7|2  •:  I vP'N  +  wPlN\2  <  |vP‘ N |j  +  \wP*N\2, 

by  linearity  and  the  triangle  inequality.  Now,  working  with  the  right  hand  side,  the  first  term  satisfies 

|uP'^|2  =  <  ^l^b- 

The  first  equality  is  because  vP '  =  v.  The  second  is  because  |B|  <  755 1 VI  v  »  parallel  to 
(1, 1,  •  -  • ,  1).  Now  to  bound  the  u>  component,  we  have 

|u)P*AT|2  <  |u>P*|2  <  Ai|w|a  <  —Ma  <  -^l*b. 

since  by  choice  of  t,  we  have  X[  <  X.  Thus  \4>PlN\2  <  jl^b-  ® 

Theorem  4.6  (Expander  Sample  Majorities)  [IZ89] Let  P  be  the  strongly  aperiodic  walk  on  Gn, 
started  in  its  uniform  stationary  distribution  r,  and  lett  be  as  defined  in  Seciion  4-1-  Let  B  C  V  be 
any  subset  of  the  vertices  such  that  \B\f\V\  <  1/100.  If  XuX2,X3,...Xtk  are  8 k  samples  drawn  t 
steps  apart  from  the  walk  P,  then  the  probability  that  at  least  4k  (half)  of  these  samples  X,  lie  in  B 
is  at  most  2-t. 

Proof:  Notice  that,  since  we  start  in  the  uniform  stationary  distribution  r  the  probability  of  a 
getting  a  given  sequence  of  results  in  B  and  out  of  B ,  say  “Out,  In,  Out,  Out”  denoted  BBBB,  is 
given  by 

Pi  {BBBB}  =  |TP‘JlfP‘tf  P‘A/P'M)|,. 

To  bound  C1  norms,  we  use  the  lemma  above  in  combination  with  the  fact  that,  for  €  #|V| 
we  have  |<£|i  <  which  is  obtained  by  applying  the  Cauchy-Schwart*  inequality. 

Call  a  sequence  o  of  8Jt  samples  ‘bad’  if  at  least  4k  of  the  samples  lie  in  B.  We  can  now  bound 

Pr{obtaining  a  specific  ‘bad’  it}  <  ^/jvj|*(P*lV)4tb  ^  =  5  4k, 

when  starting  the  walk  in  the  uniform  (stationary)  distribution  t(v)  =  jpj  for  all  v  €  V. 

Because  there  are  at  most  28t  sequences  of  results  0  (both  ‘bad’  and  otherwise),  we  have 

Pr{obtaining  some  ‘bad’  0}  <  28*5-4k  —  (256/625)*  <  2  *, 

which  is  the  desired  result.  D 

This  gives  us  the  following  corollary  by  the  same  reasoning  as  before. 


m 


m 


a 


r 


. 


4.4.  MAJORITIES  FROM  GN 


77 


Corollary  4.7  [IZ89]  There  is  a  constant  C3  (specified  in  Section  4-1),  such  that  there  is  a  de¬ 

terministic  algorithm  that  given  only  n  +  c3k  uniform  independent  random  bits,  outputs  8 k  random 
binary  n -tuples  Xu  A'j,  A‘3, . . A«  with  the  following  property. 

IfE  is  any  subset  ofV  =  {0, 1>"  such  that  |B|/|V|  <  1/100,  then  the  probability  that  at  least  4k 

of  the  samples  A’,  lie  in  B  is  at  most  2~k. 

Proofs  Apply  the  preceding  theorem.  The  constant  cj  is  a  bound  on  ‘.he  number  of  random  bsts 
required  to  take  8t  steps  of  the  walk.  0 

Remark:  As  before,  a  similar  result  holds  on  any  similar  magnifier.  Different  constants  can  be 
achieved. 

The  material  in  this  section  was  adapted  from  [IZ89],  If  B  is  taken  to  be  the  set  of  incorrect  wit¬ 
nesses  in  a  BPP  algorithm,  the  above  corollary  is  precisely  the  expander-based  result  in  that  paper. 
Equivalent  results  are  found  in  [CW89].  Both  papers  use  essentially  the  techniques  in  [AKS87],  but 
significantly  clarifying  both  the  presentation  and  the  importance  of  the  results.  □ 

Corollary  4.8  Suppose  we  have  an  algorithm  S  to  approximate  some  value  a  which,  when  given  a 
word  w  of  s  i.i.d.  uniform  random  bits,  produces  an  estimate  S(w)  of  a  such  that 

Pr{|S(to)-o|>/3}<  1/100. 

Then  there  is  an  approximation  algorithm  which,  given  s  +  C(lg(l/£))  random  bits  for  any  real  6, 
products  an  estimate  M  such  that 

Pr{|5(u;)  -  a\  >  0}  <  6. 

Proofs  Let  S(u>)  be  the  function  computed  by  the  algorithm  on  random  input  w  €  W  =  {0, 1}*. 
We  are  guaranteed  that  when  w  is  chosen  uniformly  from  W  that  Pr{|S(ui)  -  o|  >  0}  <  1/100.  It 
follows  that  if  we  consider  the  subset  B  C  W  given  by 

B  =  {w  \  |S(u>)  -  o|  >  /3} 

then  |B|/|W|  <  1/100. 

So  suppose  now  that  we  generate  8k  random  elements  u>i,  wj, . . . ,  t»s*  of  W  using  the  method  in 
Corollary  4.7  with  the  expander  G,.  This  requires  s  +  cjfc  random  bits,  and  we  are  guaranteed  that 
the  probability  that  at  least  half  (4k)  of  the  samples  11 ><  lie  in  B  is  at  most  2~*. 

It  Mows  that  the  median  Af  of  the  8fc  values  S(u>,)  must  lie  in  the  interval  [hx  -0,h3  +  0} 
with  probability  at  least  1  -  2~k.  For,  if  the  median  lies  outside  a  given  interval,  at  least  half  of  the 
S(wi)  do,  and  so  by  definition  of  B,  at  least  half  of  the  w,  are  in  B.  (This  is  the  same  reasoning 
as  in  Lemma  3.1.)  Conclude  that  if  k  >  lg(l/6),  the  probability  that  the  median  lies  within  the 
interval  [h\  —  /3,  hi  +  P]  is  *t  least  1  —  6.3 

Remark:  This  is  a  constructive  version  of  the  BPP  tesult  of  [IZ89].  D 


78 


CHAPTER  4.  USING  EXPANDERS  IN  ESTIMATION 


4*5  Combining  the  Results 

Combining  the  results  on  sample  means  and  sample  majorities  from  expanders  gives  us  the  following 
theorem. 

Theorem  4.0  There  are  constants  ci  and  C3  (specified  in  Section  4»I)f  *nch  iAat  there  is  a  deter¬ 
ministic  algorithm  that  when  given  any  function  k  :  {0,  l}n  ->  R  takes 

n  +  C2[lOO/i2/03l  -f  C3flg(l/£)] 

random  bits  and  produces  an  estimate  Af  satisfying 

The  running  time  of  the  algorithm  is  polynomial  in  n,  1/&,  lg(l/6),  A2,  and  the  cost  of  computing 
h. 

Proof:  Let  S(w)  be  the  estimate  computed  by  the  algorithm  given  by  Corollary  4.4,  whose  input 
w  is  the  string  of  s  -  n  +  c2\lOOh2/02]  bits.  That  corollary  guarantees  us  that  when  w  is  chosen 
uniformly  at  random  from  W  =  {0, 1}*  that  \S(w)  -  hx\  >  0  with  probability  at  most  1/100. 

The  result  follows  horn  Corollary  4.8.  S3 

4.6  The  Implied  Algorithm 

The  reader  should  already  have  understood  the  algorithm  implied  by  the  preceding  results.  However, 
we  specify  it  in  Figure  ^.10  for  completeness. 

The  description  makes  evident  a  decomposition  of  the  algorithm  into  two  phases:  one  in  which 
all  samples  are  produced  and  another  in  which  these  samples  are  used  to  produce  the  estimate.  This 
decomposition  places  the  algorithm  into  a  class  of  approximation  algorithms  considered  in  [BGG90, 
Section  3].  Also  it  shows  that  the  results  here  can  be  considered  to  provide  a  pseudo-random 
generator  for  the  usual  median-of-saro pie- means  algorithm.  We  discuss  this  in  the  next  section. 

4.7  Discussion 

Informally,  a  pseudo-random  generator  is  an  algorithm  that,  given  a  small  uniform  random 
input  “seed",  generates  a  larger  (correlated)  sequence  of  random  outputs,  which  are  “essentially  as 
good  as”  true  uniform  iid  values.  A  cryptographically  secure  pseudo-random  generator  would 
generate  outputs  which  were  indistinguishable  from  true  uniform  iid  values  by  all  polynomial  time 
computations.  The  existence  of  such  generators  is  closely  linked  to  the  existence  of  certain  “one-way" 
functions,  whose  existence,  in  turn,  is  a  hard  and  well-known  open  problem.  (See  [ILL89].)  Therefore 


4.7 .  DISCUSSION 


79 


researchers  have  concentrated  on  proving  that  certain  pseudorandom  generators  are  good  for  certain 
specific  applications.  To  this  end,  for  example,  Bach  [Bac87]  and  Karloff  and  Raghavan  [KR88]  have 
shown  that  certain  linear  congruential  generators  are  good  in  specific  algorithms,  such  as  taking 
square  roots  modulo  a  prime  and  Quicksort. 

We  have  shown  an  algorithm  that  produces  samples  in  {0,  l}n  by  taking  every  tth  state  of  a 
stationary  random  walk  on  Gn,  and  have  demonstrated  that  this  algorithm  can  be  used  in  a  certain 
precise  sense  as  a  pseudo-random  generator  for  computing  sample  means  (see  Theorem  4.3). 

The  authors  of  [IZ89]  (also  [CW89])  proved  the  results  of  Section  4.4  showing  that  the  same 
pseudo-random  generator  could  be  used  in  computing  certain  sample  majorities.  Their  purpose  was 
to  show  that  the  generator  could  be  used  to  rapidly  decrease  the  error  probabilities  of  any  BPP 
algorithm.  These  techniques  do  not  seem  able  to  give  Theorem  4.3  directly. 

One  can  use  either  the  results  of  Section  4.3  or  of  Section  4.4  alone  with  the  median-of-sample- 
means  algorithm.  But  neither  does  as  well  as  the  combination.  Applying  Theorem  4.3  alone  with  the 
median -of- sample- means  algorithm,  one  can  estimate  Af  within  0  of  hi  with  probability  1-6  using 
0((n  +  h2/02)\g(\/6))  independent  random  bits.  Similarly,  applying  Corollary  4.7  alone  with  the 
median-of-sample-means  algorithm  one  can  get  away  with  using  n  +  O{(h2/02)\%{l/6))  independent 
random  bits,  but  not  fewer.  However  we  have  shown  that  the  same  generator  (with  different  sires 
of  Gn)  can  be  used  twice  in  succession  to  yield  an  extra  savings  so  that  only  n+0(h2  /02)-tO(\%(l/6)) 
bits  are  used. 

We  got  this  last  idea  from  the  recent  paper  [BGG90],  in  which  the  authors  show  how  to  combine 
the  expander-based  pseudo-random  generator  presented  here  with  certain  pairwise-independence 
constructions.  Their  construction  gets  even  better  results.  For  boolean  functions  h  :  {0, 1}  “*  {0, 1} 
they  give  a  pseudo-random-generator  for  the  same  median-of-sample-means  aigorit.m  that  produces 
©((l/02)lg(l/£))  samples  using  2n  +  0(\g(l/5))  random  bits.  This  works  because  they  show  that 
using  only  2n  bits,  they  can  generate  samples  whose  sample  mean  is  within  0  of  hx  with  probability 
bounded  away  from  rero.  Then  by  Corollary  4.8,  the  result  follows. 

Notice  that  their  random  bit  requirement  is  independent  of  /3.  They  assume  (critically)  that 
0  >  2~nf2  in  order  that  the  pairwise  independence  construction  be  able  to  pioduce  enough  samples. 
This  is  a  reasonable  assumption,  since  otherwise  the  number  of  samples  produced  by  this  method 
would  be  comparable  to  the  siie  of  the  sample  space,  the  total  running  time  of  the  method  would 
be  exponential,  and  it  would  not  make  sense  to  use  this  method  over  a  simple  exhaustive  exact 
computation  of 

An  examination  of  their  proof  reveals  that  their  method  also  works  with  arbitrary  functions 
h  :  {0,  l}n  — ►  R ,  under  the  slightly  stronger  assumption  that  02/h2  >  2~n,  which  is  reasonable  for 
the  same  reason.  (Indeed,  it  it  is  reasonable  to  assume  that  h2/02  is  bounded  by  a  polynomial.) 

Any  approximation  scheme  of  the  form:  (I)  produce  several  samples  (II)  estimate  mean  using  the 
samples,  which  works  for  every  boolean  function  h  in  polynomial  time,  must  produce  ft(/?2  lg(l/^)) 


80 


CHAPTER  4 .  USING  EXPANDERS  IN  ESTIMATION 


samples.  Every  such  scheme  that  produces  this  many  sample  points  requires  ft(n  +  lg(l/<5))  random 
bits  [BGG90].  Thus  one  cannot  do  better  in  the  same  framework. 


4.7.  DISCUSSION 


81 


Algorithm  4.10  Random-Bit  Efficient  Estimation  The  setup  described  in  Section  4.1  is  as¬ 
sumed.  We  assume  also  that  we  have  an  oracle  computing  the  function  h.  This  algorithm  takes 

n  +  C2[100h2//321  +  c3  [*lg(  l/£)] 

random  bits  and  outputs  au  estimate  M  of  the  mean  value  h;  of  h  satisfying  Pr{|M  -  hi\  >  0}  <  &• 

Phase  I.  (Produce  Samples) 

Stage  1.  (Produce  Sampling  Seeds) 

•  Use  s  -  n  +  c2riOOha/02l  input  random  bits  to  specify  an  initial  vertex  of  a  random 
walk  on  Gt  •’Vse  vertex  set  is  W  =  {0,1}*.  This  starts  us  in  the  uniform  stationary 
distribution  of  the  walk.  The  C3flg(l/<5)]  additional  input  random  bits  that  we  have  are 
sufficient  to  take  all  *1  steps  of  the  Mov  ing  walk. 

•  for  1  <  i  <  8flg(l/6)l  do  begin 

•  Take  t  steps  of  the  walk  on  <?,.  AfteT  the  tth  step,  let  (sampling  seed)  wt  be  the 
current  vertex,  a  string  of  $  bits. 

Stage  2.  (Produce  Samples  from  Sampling  Seeds) 

•  for  1  <  i  <  8pg(l/*)l  do 

•  Use  the  first  n  bits  of  to  specif,  ui  initial  vertex  of  a  random  walk  on  Gn  whose 
vertex  set  is  V  s  {0,  l}n.  This  starts  uc  in  the  uniform  stationary  distribution  of  the 
walk.  The  s  -  n  =  \lOOc7h2/02]  remaining  bits  of  Wi  are  sufficient  to  take  all  of  the 
steps  of  the  following  walk. 

•  for  1  <  ;  <  \lOOh2/02]  do 

•  Take  t  steps  of  the  walk  on  Gn*  After  the  tth  step,  let  the  sample  X be  the 
current  vertex,  a  string  of  n  bit*. 

end 

Phase  EL  (Fso  luce  Estimate  fr^m  Samples) 

•  for  1  <  i  <  8 flg(l/6)"|  do 

•  Let  a<  =  (E,<;<riooii,/<J’l  Xi>)  ^mh be  the  *amp,e  me&n  °f  th*  X<1  f°r  th' 
given  i. 

•  Let  M  be  the  median  of  the  a*.  Return  the  estimate  M . 


Figure  4.10:  Random-Bit  Efficient  Estimation 


Chapter  5 

Estimating  the  Significance  of 
Contingency  Tables 


A  two-way  contingency  table  is  a  tabular  description  of  the  sixes  of  intersections  of  two  partitions 
of  a  set  of  N  elements.  Given  two  such  partitions,  71  =  Rly  R2t . . . ,  and  C  =  Cu  C2, . . . ,  Cn,  we 
get  an  m  x  n  table  T  by  letting 

Tij  =  Ift  nC)|. 

These  tables  arise  in  statistical  situations  where  one  wants  to  investigate  possible  correlations 
between  two  partitions  of  interest.  For  example,  the  object  of  a  certain  medical  study  may  be  to 
investigate  relations  between  smoking  and  the  incidence  of  heart  attacks.  A  sample  is  taken  from  a 
set  of  subjects.  The  subjects  are  classified  as  non-smokers,  light  smokers,  and  heavy  smokers  by  the 
number  of  cigarettes  consumed  daily.  The  subjects  are  also  classified  as  having  had  0,  1,  or  more 
than  1  heart  attack.  This  may  result  in  a  table  like  that  in  Figure  5.1.  (Note:  the  data  in  that  table 
are  entirely  fictitious.) 


smoking 

!  heart  attacks  1 

0 

1 

more 

none 

621 

3 

16 

light 

84 

25 

3 

heavy 

64 

37 

9 

Figure  5.1:  a  3  x  3  contingency  table 

Viewing  the  table,  there  is  an  apparent  correlation  between  heavier  smoking  and  heart  attacks. 
But  is  this  indeed  the  case?  Might  we  expect  to  see  a  similar  table  even  if  the  two  were  independent? 
One  way  of  testing  for  independence  is  to  perform  a  statistical  test  using  a  measure  of  independence 


82 


83 


called  the  chi-squared  (x2)  statistic. 

Let  n  =  \Ri\  =  and  c,  =  \C}\  =  £\  T,r  Suppose  we  adopt  a  so-caLed  null  hypothesis 

that  the  table  T  had  been  produced  simply  by  sampling  subjects  with  replacement  from  a  pop¬ 
ulation  with  a  fraction  rt/N  falling  in  category  and  a  fraction  Cj/N  falling  into  category  CJt 
with  the  two  classif  cations  being  completely  independent.  This  is  called  the  multinomial  model. 
(For  additional  background  see  [BFH75,  Eve77,  MGB74].)  Then  having  observed  the  table  T,  the 
maximum  likelihood  estimate  for  |Jfc  nC,-|/J\r  is  TtJ  =  nCj/N.  This  is  also  the  expected  value  of 
Tij  if  it  were  generated  at  random  under  the  multinomial  model.  The  statistic  X2(^)  measures  the 
distance  of  the  observed  table  T  from  this  outcome;  it  is  defined 


x*(T)  =  £ 

•v? 


We  define  the  multinomial  significance  of  the  observed  table  T  as  Pr{x3(7’')  <  X2(7')}  when 
T  is  drawn  under  the  hypothesised  multinomial  model.  The  greater  this  value,  the  more  vnlikely 
it  is  that  we  would  see  a  table  as  “skewed”  as  T,  had  the  table  simply  been  generated  at  random 
under  the  multinomial  model.  We  reject  the  hypothesis  that  the  table  was  produced  at  random 
under  this  model  if  the  significance  is  larger  than  a  certain  threshold,  typically  .90  or  .95.  We  might 
then  conclude  that  T  in  fact  shows  some  relation  between  the  partitions  V.  and  C.  This  is  called 
the  exact  multinomial  x3  test.  For  large  N,  under  the  multinomial  model,  the  distribution  of 
X3(T)  approximately  follows  a  distribution  called  ‘the  x3  distribution  with  (m  —  l)(n  —  1)  degrees 
of  freedom,’  [MGB74].  The  standard  x3  test  uses  this  approximation  to  estimate  the  multinomial 
significance. 

Recently  there  has  been  interest  in  testing  for  significance  against  a  different  null  hypothesis, 
one  in  which  the  underlying  model  is  uniform.  In  this  model  all  tables  with  the  same  row  and 
column  sums  arise  with  equal  probability.  Again,  the  significance  value  of  the  observed  table  T  is 
Pr{x3(T')  <  x3(T)>,  but  here  drawing  T  uniformly  from  the  set  of  all  tables  with  the  same  row 
and  column  sums.  Again,  we  reject  this  null  hypothesis  if  the  significance  value  is  large;  this  test  is 
called  a  conditional  volume  test.  (Diaconis  and  Efron  [DE85]  give  several  reasons  why  this  test 
is  meaningful.  The  curious  reader  is  referred  to  that  article.  The  relevant  statistics  goes  well  outside 
the  scope  of  this  work.) 

Statisticians  also  employ  tests  based  on  the  significance  in  other  underlying  models.  Sev¬ 
eral  algorithms  have  been  suggested  for  estimating  and  exactly  calculating  significance  levels  for 
the  x3  test  under  the  multinomial,  the  Fisher- Yates  (hypergeometric),  and  other  models  (e.g., 
see  [PTH81,  MPS88,  BOP88]).  It  seems  significantly  harder  to  compute  significance  values  un¬ 
der  the  uniform  model.  There  do  not  seem  to  be  any  non-exhanstive  methods  known  to  compute 
the  exact  significance.  Moreover,  it  seems  that  one  should  not  seek  a  randomised  method  that  relies 
on  exact  uniform  sampling.  We  say  this  because  computing  the  number  of  tables  with  the  same 
row  and  column  sums  as  T  is  a  well-known  and  difficult  combinatorial  problem  and  most  traditional 


84 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


methods  of  exact  sampling  would  rely  on  methods  of  computing  this  number  exactly.  (Some  combi¬ 
natorial  background  material  appears  in  Appendix  C.  Two  good  references  are  [GC77]  and  [Sta86, 
Chapter  4,  esp.  p.  232].  A  provable  relationship  between  the  complexity  of  sampling  and  approx¬ 
imate  counting  in  general  is  described  in  [JVV86].  We  explain  the  basic  idea  behind  approximate 
counting  using  sampling  in  Appendix  C.) 

In  this  chapter  we  present  a  randomised  algorithm  for  estimating  the  significance  of  tables  under 
the  uniform  model.  Our  method  is  based  on  sampling  using  a  random  walk.  We  can  prove  that 
the  walk  converges  to  the  desired  uniform  distribution,  so  that  asymptotically  the  method  must 
yield  good  estimates,  but  we  cannot  currently  give  good  (polynomial)  time  bounds  cn  the  rate  of 
convergence.  However,  we  conjecture  some  bounds  and  show  that  empirically  the  method  seems  to 
perform  well  under  the  conjectured  bounds. 

Let  us  now  fix  some  notation.  Let  r  =  (rx ,  r2,  •  •  • ,  rm)  and  c  =  (cx,  c2,  *  •  * ,  Cn)  denote  nonnegative 
integer  partitions  of  N ,  with  >  r2  >  •  •  •  >  rm  >  0  and  ci  >  c2  >  •  •  •  >  c»  >  0.  Let  Ere  denote 
the  set  of  all  m  x  n  nonneg&tive  integer  matrices  in  which  row  i  has  sum  r<  and  column  j  has  sum 
Cj .  (We  have  permuted  the  rows  and  columns  so  that  the  sums  are  in  non-increasing  order.  The 
properties  of  the  tables  that  concern  us,  the  cardinality  |Ercl  and  the  x2  statistic,  do  not  depend 
on  the  order  of  the  rows  and  columns.)  Note  that  we  have  =  N  for  «v«y  T  €  Erc»  and  also 

that  m  and  n,  the  dimensions  of  the  tables,  are  not  explicitly  present  in  the  notation  Ere  hut  are 
understood  in  context  from  the  lengths  of  r  and  c.  This  use  of  m,  n,  and  N  will  persist  throughout 
the  chapter. 

The  set  Ere  is  always  nonempty.  This  may  not  be  evident  immediately,  but  an  algorithm  for 
constructing  an  element  is  described  briefly  in  Section  C.1.4  of  Appendix  C.  We  will  also  assume 
that  m  >  1  and  n  >  1;  otherwise  Ere  has  only  one  element. 

In  Section  5.1,  we  present  a  random  walk  on  the  set  Ere-  We  can  prove  that  the  walk  converges 
to  the  uniform  distribution  on  Ere-  However,  we  cannot  currently  prove  polynomial  time  bounds 
for  the  time  to  reach  stationarity. 

In  Section  5.4,  we  present  the  results  of  some  empirical  studies  involving  the  algorithm.  We 
compare  results  obtained  by  our  randomised  method  with  those  obtained  by  exact  methods,  (when 
they  are  feasible).  We  also  compare  our  results  with  asymptotic  approximations  obtained  by  some 
analytic  methods. 

In  Section  5.5  we  discuss  how  recent  results  by  some  other  researchers  may  be  applied  to  prove 
polynomial  time  bounds  for  a  variant  of  our  scheme.  The  polynomials  in  the  bounds,  however,  are 
too  large  to  offer  both  rigorous  and  practical  utility. 


5.1.  A  RANDOM  WALK  ON  ZRC 


85 


5.1  A  Random  Walk  on  £rc 

In  this  section  we  suggest  a  random  walk  on  Ere  that  can  be  used  for  near-uniform  sampling.  We 
prove  that  the  walk  converges  to  the  uniform  distribution  on  Ere,  but  we  do  not  prove  a  polynomial 
convergence  rate  bound  for  this  walk. 

The  walk  can  start  anywhere  in  Ere-  In  practice  one  has  ‘observed’  a  table  in  Erc  «  the 
result  of  some  statistical  study,  and  this  can  be  used  as  an  initial  element  at  which  to  start  the 
walk.  Alternatively,  it  is  not  hard  to  construct  a  table  by  the  algorithm  described  in  Appendix  C, 
Section  C.1.4.  After  an  initial  element  is  obtained,  each  step  of  the  walk  is  generated  by  performing 
the  following  operations. 

Algorithm  5.2  [Basic  random  walk  on  Erc]-  Procedure  to  take  a  single  step  of  the  walk  on  Erc- 

1.  Let  X  €  Erc  be  the  current  position  of  the  walk. 

2.  Choose  a  pair  of  rows  ii  and  i2,  with  1  <  »i  <  »s  <  m,  uniformly  from  among  all  (7)  *ueh 
choices. 

3.  Choose  a  pair  of  columns  ji  and  j2,  with  1  <  j\  <  j'j  <  n,  uniformly  from  among  all  (*)  such 
choices. 

4.  Choose  d  uniformly  from  the  set  +1,  -1 

5.  Let  Y  be  the  m  x  n  matrix  obtained  from  X  by  taking 

Ji  =  ji  +  ^  =  ““  d 

and  Yiyj  =  Xiyj  for  all  other  (i,;)  €  [m]  x  [n].  Notice  that  Y  has  the  same  row  and  column 
sums  as  Y,  and  is  a  matrix  of  integers. 

6.  (Take  the  Step)  If  all  entries  of  Y  are  nonnegative,  then  move  to  Y ;  our  new  current  position 
js  y.  Otherwise  remain  at  .Y,  i.e.  our  new  current  position  remains  X .  Note:  The  entries 
of  y  will  in  fact  be  nonnegative,  unless  X  has  a  *ero  in  at  least  one  of  the  two  entries  from 
which  we  are  subtracting  1. 

Since  each  of  the  possible  moves  preserves  the  row  and  column  sums  of  the  matrix,  (adding  and 
subtracting  1),  this  walk  always  keeps  us  in  the  set  Epc.  We  now  show  that  these  steps  allow  us  to 
move  between  any  two  elements  in  Epc. 

Lemma  5.3  The  walk  whose  steps  are  generated  by  the  procedure  above  is  irreducible  on  Epc-  More - 
over,  if  (i,  jf)  is  the  lexicographically  fint  coordinate  in  which  A  and  B  differ t  then  there  is  a  path 
between  A  and  B  that  does  not  alter  coordinates  lexicographically  preceding  (i,j). 


86 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


Proof:  We  need  to  show  that  for  each  X  and  Y  in  £rc,  there  is  a  path  between  A'  and  K,  using 
only  possible  steps  of  the  walk.  Note  that  tince  each  step  is  reversible,  a  path  from  X  to  Y  implies 
a  symmetric  one  from  Y  to  A". 

We  prove  that  there  is  a  path  joining  X  and  Y  by  induction  on  a  distance  measure  between  the 
two  tables  A”  and  Y .  Define  the  distance  d(A,y)  between  A'  and  Y  as  d(A',y)  as  l^«;  —  li>|- 
Observe  that  d(X}Y)  =  0  if  and  only  if  X  =  Y.  Further  note  that,  since  the  grand  sums  in  the 
table  are  the  same,  this  distance  is  always  a  multiple  of  two. 

Let  Jk  be  a  nonnegative  integer,  and  assume  the  following  induction  hypothesis:  if  0  <  d(X,Y)  < 
2k  and  if  (i,  j)  is  the  lexicographically-first  coordinate  in  which  A'  and  Y  differ,  then  there  is  a 
path  joining  A'  and  Y  using  only  steps  of  the  walk  that  do  not  involve  coordinates  lexicographically 
preceding  (t,  y).  This  is  vacuously  true  for  k  =  0. 

For  the  induction  step,  let  A'  and  Y  be  two  elements  of  Epc  and  suppose  d(X,Y)  =  2(k  4  1), 
where  (i,  j)  is  the  first  coordinate  in  which  they  differ.  We  will  show  that  either  there  is  a  move 
from  X  to  X'  where  d(A\y)  <  2k  or  there  is  a  move  from  Y  to  Y'  where  d(XtYt)  <  2k,  where  no 
coordinates  preceding  (t,  j)  are  involved.  The  induction  hypothesis  will  then  imply  that  there  is  an 
entire  path  between  X  and  Y. 

In  step  2  of  the  algorithm  choose  ix  =  i  and  ji  =  j.  Then  there  are  two  cases  to  consider. 

Case  (a)  X%x%)x  <  Y%xjx.  Then  since  each  row  and  column  of  A'  has  the  same  sum  as  in  Yt  we 
have 

Bj2  such  that 

3i2  such  that  Xi9jl  >  Yi,jx. 

We  must  have  ix  <  i2  and  ji  <  j2,  since  (»i,  ji)  was  chosen  as  the  lexicographically  first  position 
in  which  X  and  Y  differ.  Moreover,  the  entries  Xixj ,  and  Xitjx  are  both  positive,  since  they  are 
greater  than  their  nonnegative  counterparts  in  Y .  This  means,  that  letting  d  =  +1,  the  move 

=  Xiiji  +  *  Jt  =  ""  * 

- 1  x;9j9  =  xitJ,  4 1 

yields  an  X '  having  nonnegative  entries  as  well  as  sharing  the  same  row  and  column  sums  as  X . 

By  moving  from  X  to  X\  the  difference  with  respect  to  Y  on  least  the  three  coordinates  (ix,ii), 
(ix,  y2),  and  (i2>  jx)  decreased  by  1.  The  difference  at  (i2l;2)  may  have  increased  by  1,  but  the  net 
change  in  all  four  coordinates  must  in  any  case  be  a  decrease  of  at  least  2.  That  is,  d(X',y)  < 
d(Xt Y)  —  2,  so  d(X\Y)  <  2k.  Now  by  the  induction  hypothesis  there  is  a  path  from  X 9  to  y. 
Adding  the  step  from  X  to  X '  completes  the  path  from  X  to  Y ,  without  altering  any  coordinates 
lexicographically  preceding  (t,j)  (in  which  X  and  Y  already  agree). 

Case  (b)  X%x^x  >  Yixjx.  This  case  is  entirely  symmetric.  Swapping  the  roles  of  X  and  y,  the 
same  argument  as  in  case  (a)  shows  that  there  is  a  move  from  Y  to  Y9  with  d(X,Y*)  <  2k.  Thus, 


5.2.  BIGGER  STEPS 


87 


by  the  induction  hypothesis,  there  is  a  path  between  A'  and  Y’,  and  hence  a  path  between  A'  and 
Y  via  V.  □ 

Combining  the  above  lemma  with  the  fact  that  the  Markov  chain  is  symmetric  yields  the  following 
desirable  result. 

Theorem  5.4  The  walk  whose  steps  are  generated  by  the  procedure  of  Algorithm  5.2  is  ergodic  and 
has  uniform  stationary  distribution  on  Ere  • 

Proof:  If  Pxy  »  the  probability  of  moving  from  A*  to  Y  in  one  step,  then  Pxy  =  since  if 
we  are  at  >\  then  to  move  to  X  we  need  to  pick  the  same  rows  and  columns  as  when  moving  from 
X  to  V,  but  pick  d  with  the  opposite  sign.  Thus  the  walk  is  a  symmetric  Markov  chain,  and  its 
unique  stationary  distribution  will  therefore  be  uniform,  provided  that  the  walk  is  in  fact  ergodic. 
(See  Chapter  1.) 

In  order  to  show  that  the  walk  is  ergodic,  we  need  to  show  it  is  irreducible  and  aperiodic. 
We  showed  the  walk  was  irreducible  in  Lemma  5.3.  The  walk  cannot  be  periodic,  since  there  are 
necessarily  some  states  in  which  there  is  a  possibility  of  immediate  return.  To  see  this  notice  that 
starting  at  any  table  and  making  any  given  move  repeatedly,  one  can  proceed  at  most  N  times  before 
one  of  the  two  entries  from  which  we  are  repeatedly  subtracting  1  reaches  sero.  From  such  a  state, 
choosing  the  same  move  will  cause  us  to  remain  at  the  same  position  for  that  step,  an  immediate 
return.  Hence  that  state  is  aperiodic.  Since  the  chain  is  irreducible,  the  aperiodicity  of  one  state 
implies  that  every  state  is  aperiodic.  S 

From  this  and  the  Basic  Convergence  Theorem  (1.1),  we  can  conclude  that  if  we  run  the  walk 
long  enough,  we  can  certainly  use  it  to  obtain  uniform  samples. 


5.2  Bigger  Steps 

The  basic  walk  uses  unit  step  sises.  That  is,  each  step  of  the  walk  only  adds  or  subtracts  1  from  the 
entries  it  affects.  Intuition  tells  us  that  allowing  moves  of  step  sises  greater  than  1  will  increase  the 
rate  of  convergence.  But  one  must  be  somewhat  careful  in  designing*  a  walk  with  larger  possible 
step  sises.  We  wish  that  our  walk  be  time-reversible  (so  that  we  can  apply  our  earlier  results)  and 
have  uniform  stationary  distribution.  This  will  be  the  case  if  and  only  if  the  chain  is  symmetric. 
The  choice  of  the  step  si se  should  be  such  that  the  probability  of  a  single  step  of  the  walk  from  a 
table  Ti  to  a  table  T2  and  the  step’s  reversal  from  T2  to  Tx  are  the  same.  A  naive  method  may 
disrupt  the  symmetry  of  the  walk  and  make  the  stationary  distribution  non-uniform.  For  example, 
one  might  initially  think  of  the  following  strategy:  At  each  step  choose  a  random  step  sise  between 
1  and  the  minumum  entry  on  the  four  corners  of  the  move.  This  walk  will  remain  aperiodic  and 


88 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY '  TABLES 


irreducible,  but  it  will  no  longer  have  symmetry.  The  probability  of  a  step  and  its  reversal  will  not 
always  be  the  same. 

One  can  legitimately  choose  the  step  sire  based  on  the  sum  (or  average)  of  the  entries  on  the 
four  chosen  corners  of  the  move;  this  sum  remains  invariant  under  the  move,  and  this  guarantees 
that  the  move  and  its  reversal  obey  the  squired  symmetry.  In  our  experiments,  we  actually  use  the 
following  walk  with  random  step  sues. 

Algorithm  5.5  [Random  walk  with  larger  step  sues].  Procedure  to  take  a  single  step  of  the  walk 
on  Ere  using  random  step  sises. 

1.  Let  the  X  £  Ere  he  the  current  position  of  the  walk. 

2.  Choose  &  pair  of  rows  »i  and  with  1  <  i\  <  t2  <  m,  uniformly  from  among  all  (™)  such 
choices. 

3.  Choose  a  pair  of  columns  ji  and  j2%  with  l  <  Ji  <  jj  <  uniformly  from  among  all  (j)  such 
choices. 

4.  Choose  d  uniformly  from  the  set  4-1,  -1 

5.  Let  a  =  [X* li>,  +  Xitj9  +  XiixJl  4  Xt<>,/32],  and  choose  *  uniformly  from 

6.  Let  Y  be  the  m  x  n  matrix  obtained  from  X  by  taking 

Ji  =  +  ds  Yit  —  X%t  j,  —  ds 

—  ds  =  +  ds 

and  Yij  =  Xtj  fox  all  other  (t,j)  £  [m]  x  [n].  Notice  that  Y  has  the  same  row  and  column 
sums  as  X ,  and  is  a  matrix  of  integers. 

7.  (Take  the  Step)  If  all  entries  of  Y  are  nonnegative,  then  move  to  Y ;  our  new  current  position 
is  y.  Otherwise  remain  at  X,  i.e.  our  new  current  position  remains  X.  Note:  The  entries  of 
y  will  in  fact  be  nonnegative,  unless  X  <  $  in  at  least  one  of  the  two  entries  from  which  we 
are  subtracting  $ . 

An  argument  similar  to  the  one  in  Theorem  5.4  shows  that  the  modified  algorithm  also  giyes  an 
ergodie  walk  with  uniform  stationary  distribution  on  Ere* 

The  constant  32  that  is  used  in  step  5  is  somewhat  arbitrary.  We  chose  it  by  experimentation 
from  among  a  small  set  of  values  tested  in  walks  on  some  typical  sets  Ere  that  we  wished  to  study; 
the  value  32  was  small  enough  that  in  in  step  5  we  did  not  timply  get  a  =  1  most  of  the  time,  and 
32  was  large  enough  that  a  high  ratio  of  moves  attempted  in  step  7  were  actually  taken.  In  general, 
these  properties  depend  on  the  range  of  values  appearing  in  the  tables. 


5.3.  EIGENVALUE  HYPOTHESIS 


89 


5.3  Eigenvalue  Hypothesis 


We  are  currently  unable  to  prove  general  polynomial  upper  bounds  on  the  convergence  rate  of  the 
walk.  So  we  are  not  in  a  position  to  say  precisely  how  long  we  must  run  the  walk  to  obtain  good 
samples.  We  can  prove  bounds  in  some  special  cases  and  for  some  analogous  walks,  and  we  discuss 
these  later.  In  this  section,  we  motivate  an  hypothesis  about  the  second  absolute  eigenvalue  of  the 
walk  on  £rc.  Under  the  hypothesis  we  can  calculate  bounds  on  the  convergence  rate  and  the  number 
of  samples  needed  to  obtain  given  variance  in  our  sample  means.  We  verify  empirically  that  these 
seem  to  be  valid,  though  not  necessarily  tight. 

For  any  table  T  £  Erc,  knowing  the  d  =  (m  -  l)(n  -  1)  entries  T0  with  i  ^  1  and  j  £  1  allows 
us  to  determine  the  remaining  entries  by  the  sum  constraints.  We  can  thus  specify  the  table  T  as 
a  point  of  the  integer  lattice  in  d-dimensional  space,  and  the  set  Ere  can  be  associated  to  the  set 
of  lattice  points  in  the  d-dimensional  convex  polytope  given  by  the  sum  constraints.  Each  entry 
jv  must  satisfy  0  <  TtJ  <  min(rl,c>),  so  it  follows  that  the  entire  convex  region  sits  within  the 
corresponding  d-dimensional  “bounding  box”.  Our  walk(s)  on  Ere  walks  on  these  lattice  points, 
where  the  set  of  possible  transitions  are  a  superset  of  the  set  of  lattice  edges.  (A  lattice  edge  joins 
two  integer  lattice  points  whose  distance  is  1.) 


Conjecture  5.6  (Eigenvalue  Hypothesis)  For  the  walk  on  Erc  described  in  Algorithm  5.5,  the 
second  absolute  eigenvalue  A.  satisfies 


1 

1  -  A. 


<  J  Y  tam(ri,c})7. 


Our  conjecture  is  motivated  by  two  intuitive  lines  of  thought.  First,  one  expects  that  the  conduc¬ 
tance  (see  Chapter  1)  of  subgraphs  of  the  lattice  grid  that  are  delimited  by  convex  polytopes  will  have 
conductance  matching  that  of  the  grid.  The  walk  on  the  d-dimensional  lattice  ‘box*  whose  sides  are 
min(rj,Cj)  (1  <  i  <  m,  1  <  j  <  n)  would  satisfy  the  conjecture.  For  example,  in  the  case  of  a  2  x  2 
table  with  row  sums  r  1,1*2  and  column  sums  Ci,C2,  the  basic  walk  of  Algorithm  5.2  may  be  viewed 
as  the  natural  random  walk  on  a  line  segment  of  length  t  =  min{ri,ci}  -  max{0,r2  -  ci.cj  -  rx}. 

Our  second  motivation  comes  from  an  analogy  with  the  limiting,  continuous  case.  Payne  and 
Weinberger  [PW60]  proved  the  following  theorem. 


Theorem  5.7  (Payne- Weinberger)  Let  D  be  a  convex  body  with  diameter  a  in  Rn .  Then 

4 ITffjg  >  t  (s.„ 

where  the  infimvm  it  over  all  function*  f  with  bounded  second-derivative  on  D,  and  satisfying 
JD  fdD  =  0,  and  here  Vf  denotes  the  real  gradient  operator  and  |  |  is  the  &  norm  on  Rn. 

The  theorem  applies  to  an  arbitrary  convex  body.  When  it  so  happens  that  the  boundary  of  D 
is  sufficiently  smooth,  the  infimum  in  5.1  gives  the  minimax  characterisation  of  the  eigenvalue  of 


90 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


the  Laplacian  operator  QT"=1  that  is  smallest  over  all  functions  whose  gradient  field  exhabits 
no  flow  across  the  boundary  of  D .  (Commonly  called  a  Neumann  condition,  this  is  expressed 
by  requiring  =  0  everywhere  on  the  boundary  of  D,  where  at  each  point  on  the  boundary  n 
represents  the  outward  normal  vector.)  However,  to  avoid  smoothness  requirements,  one  may  work 
strictly  with  5.1,  which  holds  for  all  functions  satisfying  the  stated  conditions,  whether  or  not  this 
inftmum  corresponds  to  an  eigenvalue  of  V2  on  D. 

The  reader  will  note  the  similarity  of  5.1  to  1.8.  It  is  a  direct  analogue.  Indeed,  the  Q-matiix 
of  a  finely-spaced  lattice  graph  (mesh)  within  a  body  D  is  commonly-used  difference  approximation 
to  the  Laplacian  operator  on  the  D}  (see  [BL84]).  Fine  meshes  correspond,  by  scaling,  directly  to 
the  case  when  N  is  large.  It  is,  therefore,  reasonable  to  believe  that  an  analogous  bound  holds  for 
lattice  graphs  embedded  within  large  convex  bodies.  In  Section  5.6  we  describe  an  approximation 
conjecture  that  would  imply  the  eigenvalue  bound. 

Dyer,  Friese,  and  Kannan  [DFK89]  proved  a  bound  on  the  eigenvalues  of  walks  on  lattice  graphs 
within  certain  convex  sets.  Their  bound  requires  that  the  region  be  “well-rounded”  (in  a  certain 
precise  sense)  and  is  somewhat  weaker,  but  is  still  useful  to  us.  We  discuss  this  in  Section  5.5. 
Remark:  The  author  tried  unsuccessfully  to  show  a  specific  eigenvalue  bound  for  the  basic  walk 
on  £rc»  using  the  ‘canonical  path’  methods  presented  in  Chapter  1  on  the  paths  described  in 
Theorem  5.4.  □ 

5.4  Experimental  Results 

The  results  in  this  section  are  experiments  using  computer  simulations  of  the  random  walk.  The 
reader  should  find  that  they  lend  supportive  evidence  to  the  eigenvalue  hypothesis,  and  demonstrate 
the  practicality  of  the  algorithms  suggested.  All  of  our  experiments  seem  to  perform  at  least  as  well 
as  we  should  expect  using  our  bounds,  apd  the  conjectured  eigenvalue  bound.  We  ran  several  more 
experiments  than  those  we  present  here.  However,  most  of  the  relevant  issues  seem  to  be  covered 
by  this  set  of  examples. 

These  experiments  were  all  conducted  on  a  DECstation  3100,  a  desktop  workstation  that  executes 
about  (1.2  ±  .2)  x  107  VAX-equivalent  machine  instructions  per  second.  The  experiments  were  run 
in  a  time-sharing  environment.  This  means  that  the  process  computing  the  experiment  shares  CPU 
time  with  other  processes  on  the  system,  including  other  user’s  tasks.  The  CPU  figures  presented 
with  each  experiment  are  the  ‘user*  CPU  seconds  actually  used  by  the  process.  (This  excludes  CPU 
time  used  by  kernel-level  tasks  initiated  by  the  process).  The  effect  of  timesharing  in  the  CPU  time 
figures  is  negligible.  Real  time  figures  are  also  presented;  these,  however,  are  quite  dependent  upon 
the  amount  of  system  load  incurred  by  other  processes  running  concurrently,  but  since  in  most  cases 
the  experiment  was  allotted  more  than  90%  of  the  CPU  cycles  that  elapsed  during  its  execution, 
this  effect  is  also  small. 


5.4.  EXPERIMENTAL  RESULTS 


91 


Random  numbers  were  generated  using  a  non-linear  feedback  generator  provided  by  the  system. 
The  period  of  the  generator  is  very  large,  approximately  3.4  x  1010,  and  low-order  portions  of  the 
generated  values  ate  also  purported  to  look  random.  The  time  of  day  was  used  as  a  ‘random’  initial 
seed. 

The  programs  for  the  experiments  were  coded  in  ‘C.’  We  wrote  code  for  Mathemat  ca  (tm)  to 
compute  Diaconis  and  Efron’s  analytical  approximations.  These  may  be  obtained  from  the  author. 

5.4.1  Background 

In  this  discussion,  we  assume  background  material  from  Chapters  1  and  Chapter  3.  We  also  refer 
to  the  Kolmogorov-Smirnov  distance  between  two  distributions.  This  is  defined  and  discussed  in 
Appendix  A. 

Fix  Ere  and  an  ‘observed’  table  T  €  Ere-  Let  be  the  second-largest  eigenvalue  of  the  Markov 
chain  whose  steps  are  described  by  Algorithm  5.5,  and  let  t  =  1/(1  —  Aj) 

Let  E  =  {T1  |  x2{T')  <  X2{T)}  be  the  subset  of  Ere  consisting  of  tables  with  smaller  x2  values. 
If  h  is  the  indicator  function  of  H,  then  the  significance  of  T  is 

p=fc1  =  |£r|/|Ercl, 

the  mean  value  of  h(T’)  when  T'  is  chosen  uniformly  from  Ere-  The  results  of  Chapter  3  and 
Section  3.3  tell  us  that  for  any  0  >  0  we  can  get  an  estimate  M  satisfiying 

Pi{\M-p\  >0}>. 96  (5-2) 

by  letting  M  be  the  median  of  5  independent  sample  means  each  based  on  4 (It  -  l)p(l  -  p)0~2, 
which  is  at  most  (2r  —  1)/?“J  adjacent  samples  drawn  from  the  chain  in  the  stationary  distribution. 

The  value  of  7^,  and  hence  also  r,  is  bounded  by  the  eigenvalue  hypothesis.  To  calculate  the 
number  of  steps  to  use  to  get  near  stationarity,  we  multiplied  our  conjectured  bound  on  77*7  by 
lnfll,  (1  +  min(ri,c>)))  which  is  a  bound  on  In  |Ercl.  (***  Theorem  1.8). 

The  randomised  experiments  we  describe  here  are  all  based  on  the  following  scheme.  The  random 
walk  of  Algorithm  5.5  is  simulated  for  a  number  of  steps  determined  by  the  aforementioned  formula 
to  get  near  stationarity.  Then  some  multiple  of  2r  samples  are  drawn  by  taking  adjacent  states  of 
the  chain.  These  samples  are  used  in  two  ways:  (a)  the  precise  proportion  of  samples  whose  x2  value 
is  less  than  that  of  the  input  table  is  recorded,  (b)  a  bin  is  associated  with  each  interval  of  width 
0.5  of  the  real  line,  and  the  number  of  samples  whose  x2  values  fall  in  each  bin  is  counted  to  make 
a  sample  histogram.  This  is  repeated  five  times.  The  significance  estimate  we  report  is  the  median 
of  the  sample  means  recorded  in  (a),  and  accuracy  bounds  are  based  on  (5.2).  The  histogram  we 
speak  about  is  obtained  by  taking  the  median  count  in  each  bin  of  the  five  histograms.  Computing 
times  that  vs  report  include  the  time  consumed  in  all  five  sample  runs,  computation  of  the  median 
significance  value,  and  median  histogram. 


92 


CHAPTER  S.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


Where  exact  distributions  and  courts  were  obtained,  the  underlying  algorithm  is  the  exhaustive 
enumeration  method  described  in  [PTH81].  ‘They  also  describe  a  non-exhaustive  method  for  deter¬ 
mining  exact  multinomial  significance.)  The  exhaustive  method  was  used  similarly  to  produce  two 
values:  (a)  an  exact  significance  value,  being  the  proportion  of  tables  whose  x3  value  was  smaller 
than  that  of  the  input  table  (b)  and  an  exact  histogram,  which  counts  the  number  of  tables  whose 
X2  values  fall  in  each  bin  of  width  0.5. 

The  reader  should  note  that  the  histograms  are  subject  to  a  discretisation  error  introduced  by 
taking  bins  of  width  0.5,  and  this  will  have  an  effect  on  the  Kolmogorov-Smirnov  distances  we  report. 
The  significance  values,  both  exact  and  estimated,  on  the  other  hand  are  based  on  specific  counts  of 
the  x2  values  less  than  that  of  the  input  table,  and  are  not  subject  to  the  same  errors.  The  inherent 
error  in  these  values  due  to  limited-precision  floating-point  arithmetic  on  the  computer  should  be 
negligible. 

5.4.2  Pinckney  Gag  Rule 

The  table  in  Figure  5.8  was  taken  from  [BFH75,  p.  99].  It  records  a  vote  taken  in  1836  in  the  U.S. 
House  of  Representatives  on  the  “Pinckney  Gag  Rule*,  which  would  have  banned  certain  anti-sir  very 
petitions.  States  were  classified  as  northern,  border,  and  southern  states.  By  looking  at  this  vote 
can  one  detect  the  presence  of  a  North-South  bias  on  this  issue  roughly  25  years  before  the  Civil 
War? 


Yea 

Abstain 

Nay 

r, 

North 

61 

12 

60 

133 

Border 

17 

6 

1 

24 

South 

39 

22 

7 

68 

c: 

117 

40 

68  j 

j  N  =  225 

Figure  5.8:  1836  Vote  on  ‘Pinckney  Gag  Rule* 

There  is  apparent  percentage-wise  diagonal  trend  running  from  South- Yea  to  North-Nay.  The 
X3  statistic  is  invariant  under  permutations  of  the  rows  and  columns,  thus  one  cannot  hope  to  use  it 
to  validate  the  observed  diagonal  trend.  However,  one  can  test  whether  the  amount  of  ‘imbalance’ 
in  the  table  is  likely  under  models  in  which  the  rows  and  columns  are  independent. 

The  observed  table  has  x3(T)  ss  41.08  which  places  it  higher  than  the  95th  percentile  of  the  x3 
distribution  on  4  degrees  of  freedom.  Its  significance  in  the  standard  x3  test  is  therefore  greater 
than  .95. 

The  conditional  volume  test,  however,  accords  it  substantially  less  significance.  An  exhaustive 
enumeration  of  the  tables  is  feasible  here,  and  was  computed  in  73.7  user  CPU  seconds  and  75 


5.4.  EXPERIMENTAL  RESULTS 


93 


seconds  of  real  time.  There  were  545025  tables  with  the  same  row  and  column  sums  as  the  observed 
table.  The  observed  table  has  significance  .2959.  That  is,  a  little  less  than  30%  of  the  tables  with 
the  same  row  and  column  sums  have  x *  values  no  larger  than  that  of  the  observed  table. 

We  used  the  random  walk  method  with  113,  017  steps  for  each  re-randomiration  and  2r  =  6584 
samples  for  each  of  the  five  contributing  histograms.  The  entire  process  took  28.2  user  CPU  seconds 
and  34  seconds  of  real  time,  and  yielded  the  median  histogram  of  Figure  5.9. 


fr*q  x  1D000  0 


Figure  5.9:  Exact  (dark)  and  sample  (lighter)  distributions  of  \2  under  the  uniform  distribution 
on  £;*c  with  margins  corresponding  to  the  ‘Pinckney  Gag  Rule*  table  shown  in  Figure  5.8.  The 
densities  were  plotted  by  joining  the  histogram  data  points.  Both  histograms  are  normalised  here 
so  that  each  has  total  mass  100000.  A  difference  of  200  in  the  y-coordinate  marks  a  difference  of 
only  .002  in  density.  Kolmogorov- Smirnov  distance  between  the  two  histograms  is  0.021.  The  table 
has  x?  =  41.08  which  falls  at  the  29.6%  significance  level  in  the  exact  distribution  and  29.8%  in  the 
sample  distribution. 

Our  bounds  give  essentially  no  usable  guarantee  on  the  accuracy  of  the  histogram.  They  tell  us 
the  significance  will  be  within  1  of  the  true  significance  with  probability  at  least  31/32.  But  the 
significance  takes  on  values  between  0  and  1,  so  this  bound  is  trivial. 

In  actuality,  however,  the  Kolmogorov- Smirno*'  distance  between  the  two  histograms  in  Figure  5.9 
is  only  0.021.  This  means  the  difference  in  the  significance  based  on  the  exact  and  sample  histograms 
is  at  most  0.02  for  any  table  T.  In  fact,  we  find  that  our  estimate  for  the  observed  table  T  is  .2982, 
which  agrees  to  two  significant  figures  with  the  exact  significance. 

In  order  to  have  guaranteed  performance  this  good,  namely  an  error  of  at  most  .01  under  the 


94 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


hypothesized  eigenvalue  bound  knowing  that  the  true  value  of  p  =  .30,  we  would  expect  to  require 
8400  ss  4  *  p(l  -  p)  *  (1/.01)2  times  as  many  samples  in  each  run.  In  this  situation,  our  bounds 
can  only  show  that  four  runs  of  55, 305,  600  samples  each  would  be  sufficient  to  guarantee  that  the 
estimate  is  within  .01  of  the  true  value  with  probability  greater  than  .96.  This  would  take  about  3 
days  of  real  time  on  the  same  computer,  so  it  would  far  better  to  calculate  the  distribution  exactly 
for  this  case.  The  performance  we  get  here,  however,  seems  to  be  far  better  than  anything  we  have 
proved,  even  under  the  hypothesised  eigenvalue  bound. 

5.4.3  Hair  Color  v.  Eye  Color 

The  table  shown  in  Figure  5.10  is  discussed  in  [DE85].  It  records  the  hair  color  and  eye  color  of  592 
subjects.  One  expects  a  significant  correlation.  The  x2  value  of  the  table  is  138.29  which  places  its 
multinomial  significance  higher  than  .90. 


Eye  Color 

!  Black 

Hair  C< 
Brunette 

?lor 

Red 

Blond 

Ti 

Brown 

i  68 

119 

26 

7 

■EQjl 

Blue 

20 

84 

17 

94 

■Ba 

Hazel 

i  15 

54 

14 

10 

mm 

Green 

1  5 

29 

14 

16 

i-  64 

c3 

|  108 

286 

71 

127 

N=592 

Figure  5.10:  Hair  color  and  eye  color  of  592  subjects 

Using  an  approximation  formula  in  [DE85],  (see  Appendix  C,  Section  C.1.3)  we  estimate  that 
the  number  of  tables  with  matching  margins  is  greater  than  1015.  This  is  too  large  to  enumerate 
exhaustively,  so  that  an  exact  distribution  is  not  available  for  comparison.  Diaeonis  and  Efron  [DE85] 
obtain  an  analytic  estimate  of  .41.  They  know  this  to  be  an  overestimate.  They  reduce  this  to  .25 
by  estimating  the  effect  of  what  they  call  protrusion'.  Using  their  same  formula,  we  re-calculated 
their  protrusion  estimates,  and  found  the  reported  protrusion  estimates  to  to  be  in  error.  Using  our 
revised  protrusion  estimate,  the  table  has  significance  between  .16  and  .41.  We  believe  that  .16  is 
pretty  close  to  the  truth. 

Diaeonis  and  Efron  also  believed  their  .25  figure  to  be  an  over-estimate.  To  try  to  get  a  handle  on 
the  actual  value,  they  used  a  different  randomised  algorithm  which  gives  an  estimate  of  .09.  Their 
method  produces  tables  from  Ere  according  to  the  Fisher-Yates  (hypergeometric)  distribution,  with 
an  analytic  approximate  correction  to  the  uniform.  They  were  able  to  run  their  algorithm  for  smaller 
values  of  N  (N  =  40,  N  =  60,  and  N  =  80),  scaling  the  table  entries  proportionally.  Then  they 
extrapolated  to  the  actual  value  of  N  s  592.  In  our  section  on  scaling,  below,  we  discuss  why  we 
believe  this  was  an  under-estimate. 


5.4.  EXPERIMENTAL  RESUTTS 


95 


Our  new  method  estimates  the  significance  at  .15.  The  median  histogram  is  shown  in  Figure  5.11. 
This  was  computed  using  T  =  1799772  to  randomise  and  2 r  =  46, 148  samples  in  each  of  the  five 
contributing  histograms.  The  entire  process  took  457.16  user  CPU  seconds  and  ran  in  just  under  8 
minvtcs  of  real  time.  Assuming  that  the  true  value  is  indeed  close  to  .15,  our  bounds  would  require 
5100  times  as  many  samples,  or  about  2.35  x  10s  samples  per  run,  in  order  to  guarantee  an  error  of 
at  most  0.01  with  probability  greater  than  .96.  This  would  take  an  estimated  28  days  on  the  same 
computer  (though  achieving  the  accuracy  0.01  with  confidence  about  .75  would  only  take  about  5.6 
days).  Here  are  two  points  for  comparison:  (1)  the  statistical  studies  from  which  such  tables  arise 
often  take  several  months;  (2)  an  exhaustive  enumeration  on  the  same  computer  at  an  estimated 
rate  of  500,000  tables  per  minute  would  take  about  3805  years.  A  factor  of  5  to  10  speedup  in 
the  computational  power  of  desktop  computing  over  the  next  decade  would  make  the  random  walk 
method  much  more  reasonable.  Any  method  that  relies  on  enumerating  a  substantial  fraction  of  the 
tables,  however,  would  remain  equally  unreasonable. 


fr«q  *  TOOOOO 


Figure  5.11:  Sample  histogram  of  x2  under  the  uniform  distribution  on  Ere  with  margins  as  in 
Figure  5.10.  Total  mass  is  again  normalised  to  100000.  The  smoothness  and  essential  completeness 
of  the  curve  suggests  that  the  walk  sampled  a  representative  selection  of  the  approximately  10 
tables  in  Ere* 


96 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


5.4.4  Irregular  Margins 

One  expects  the  worst  performance  from  the  random  walk  when  the  row  and  column  sums  (margins) 
are  very  irregular.  This  gives  rise  to  sets  £rc  whose  convex  polyhedral  description  is  not  very  weL- 
rounded.  We  show  some  results  from  experiments  with  such  “lumpy”  tables.  So  that  we  can  compare 
with  exact  results,  we  use  small  values  of  N  for  which  exhaustive  enumeration  is  also  feasible. 


Figure  5.12:  A  table  with  irregular  margins. 

The  table  of  Figure  5.12  is  a  table  with  irregular  margins.  It  was  constructed  artificially,  and  is 
not  from  an  actual  study.  An  exact  enumeration  took  60.3  user  CPU  seconds  and  1  minute  and  2 
seconds  of  real  time.  There  are  252253  tables  with  the  same  margins.  The  x2  value  for  the  given 
table  is  9.14,  and  its  exact  significance  to  five  significant  figures  is  0.05388. 

The  Diaconis-Efron  approximation  estimates  the  significance  between  0.02696  (without  protru¬ 
sion  correction)  and  0.04405  (with  protrusion  correction). 

We  discuss  two  experiments  on  this  table.  In  the  first,  the  walk  was  run  for  T  =  2954  steps  to 
randomise  and  five  runs  of  2 r  =  198  steps  each  were  combined  to  obtain  a  median  histogram  and 
median  significance  estimate.  This  took  only  .847  user  CPU  seconds  and  2  seconds  of  real  time  and 
gave  a  significance  estimate  of  .0455.  This  is  within  .0016  of  the  true  significance.  However,  the 
Kolmogorov- Smirnov  distance  between  the  two  histograms  was  much  larger,  0.17.  This  means  that 
significance  estimates  for  some  other  tables  in  Eye  would  have  been  worse.  However,  the  result  is 
well  within  our  bounds  for  this  amount  of  sampling,  which  say  only  that  the  significance  estimate 
should  have  been  within  .44  of  the  true  significance  with  high  probability. 

In  the  second  experiment,  we  used  the  same  number  of  steps,  T,  to  randomise,  but  now  made 
five  runs  of  2000r  =  198000  adjacent  samples  each.  According  to  our  bounds,  under  the  eigenvalue 
hypothesis,  this  would  guarantee  that  the  error  in  the  significance  estimate  is  at  most  .014  with 
probability  greater  than  .96.  In  actuality,  we  did  much  better.  Our  estimate  from  this  experiment 
to  five  significant  figures  is  .05388  agreeing  with  the  true  value  in  all  these  digits.  So  the  error  here 
is  less  than  .000005.  Moreover,  the  Kolmogorov-Smirnov  distance  between  the  sainj;i«  and  exact 
histograms  was  less  than  .002. 

The  time  required  for  this  second  experiment  was  120  seconds  of  user  CPU  time,  and  2  minutes 


5.4.  EXPERIMENTAL  RESULTS 


97 


37  seconds  of  real  time,  or  roughly  twice  the  time  required  for  the  exact  enumeration.  The  exact 
and  sample  histograms  are  shown  in  Figure  5.13.  It  should  be  noted  that  the  number  of  samples 
being  computed  in  each  of  the  five  runs  was  a  sigrificant  fraction  of  the  number  of  tables.  This 
makes  the  accuracy  obtained  less  surprising,  however  the  following  is  to  be  noted,  here  and  in  the 
other  experiments.  If  the  walk  had  a  convergence  rate  much  worse  than  our  conjectured  rate,  we 
might  expect  to  “get  stuck”  in  some  small  region  of  the  set  Erc.  Instead,  our  sampling  seems  to  get 
a  representative  portion  of  FJrc»  ns  we  would  expect  from  near-uniform  sampling. 


98  CHAPTER  5 .  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


fr«q  Jt  100000 


fr*q  z  100000 


Figure  5.13:  Histograms  in  the  second  (larger-sample)  experiment  on  the  table  in  Figure  5.12.  The 
sample  histogram  is  depicted  alone  in  the  upper  plot,  and  underlying  the  exact  histogram  (dark) 
in  the  lower  plot.  Each  histogram  is  normalised  to  have  total  mass  100000.  Kolmogorov-Smimov 
distance  between  the  two  histograms  is  less  than  0.002.  The  significance  estimate  for  the  observed 
table  is  off  by  less  than  0.000005. 


5.4.  EXPERIMENTAL  RESULTS 


99 


Ti 

5 

2 

3 

10 

50 

7 

5 

62 

3 

6 

4 

13 

5 

3 

3 

11 

2 

7 

30 

39 

65 

25 

45 

|  N=135 

Figure  5.14:  Another  table  with  irregular  margins 

Figure  5.14  shows  another  table  with  irregular  margins.  The  exhaustive  enumeration  of  the 
corresponding  set  Eye  took  54665  CPU  seconds,  and  15  hours  and  21  minutes  of  real  tune.  There 
were  239382173  tables  in  the  set.  The  specified  table  has  x2  =  72.18  and  exact  significance  to  five 
significant  figures  is  .76086. 

The  Diaconis- Efron  approximation  very  accurately  estimates  the  number  of  tables  in  Ere  ** 
2.33  x  108,  but  their  approximation  grossly  overestimates  the  significance  region,  resulting  in  a 
significance  estimate  greater  than  1,  even  with  their  protrusion  correction. 

We  ran  our  Monte  Carlo  method,  again  employing  median  of  sample  mean  estimates  over  five 
runs.  In  each  run  we  used  T  =  49936  steps  to  randomise,  and  took  2000r  =s  2297000  adjacent 
samples.  The  entire  process  took  23  minutes  and  31  seconds,  or  about  l/30th  the  time  of  the 
exhaustive  method. 

Our  guarantees  place  our  estimate  witnin  .027  of  the  true  value  with  probability  greater  .96.  The 
resulting  significance  estimate  was  .7638,  which  is  off  by  .003,  much  less.  The  histograms  are  shown 
in  Figure  5.15.  The  Kolmogorov-Smirnov  distance  between  them  is  less  than  .005.  In  this  case,  in 
each  of  our  five  runs,  the  number  of  samples  was  only  about  1%  of  the  sise  of  the  sample  space. 
This  result  is  therefore  more  convincing  than  that  of  the  preceding  experiment. 


100 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


(r.q  x  100000 


£r.q  x  100000 


Figure  5.15:  Histograms  for  the  table  in  Figure  5.14.  The  sample  histogram  is  depicted  alone  in  the 
upper  plot,  and  underlying  the  exact  histogram  (dark)  in  the  lower  plot.  Each  histogram  is  normal¬ 
ised  to  have  total  mass  100000.  The  Kolmogorov-Smirnov  distance  between  the  two  histograms  is 
less  than  0.005,  and  the  significance  estimate  for  the  observed  table  is  off  by  only  0.003.  The  exact 
histogram  took  over  15  hours  to  compute,  while  the  sample  histogram  took  under  24  minutes. 


5.5.  provadly  polynomial-time  methods 


101 


5.4.5  Scaling 

In  the  examples  above,  the  large  number  of  steps  required  to  reach  stationarity  is  due  essential!}  to 
the  presence  of  large  margins,  which  in  turn  are  due  to  large  sample  sises  N .  One  might  hope  to 
get  good  approximations  as  follows.  Scale  the  table  to  have  grand  sum  N  K  N ,  by  multiplying  by 
N1  /N  throughout,  rounding  as  needed.  (One  hopes  that  the  effect  of  rounding  is  small).  Calculate 
or  estimate  the  significance,  for  several  such  N'  and  extrapolate  to  the  larger  N . 

The  problem  with  this  method  is  that  it  is  not  always  easy  to  extrapolate  accurately.  Certainly, 
for  very  large  N,  the  significance  value  is  not  very  dependent  on  N.  But  gauging  the  passage  from 
small  N'  to  much  larger  N,  seems  to  be  difficult.  Diaeonis  and  Efron  attempt  such  an  extrapolation 
for  the  Hair-Eye  color  table  of  Figure  5.10.  Using  their  Monte-Carlo  method,  they  estimate  the 
significance  of  the  table  for  N'  =  40,  N'  =  60,  and  N'  =  80,  at  .036,  .051,  and  .069,  respectively. 
They  extrapolate,  (they  do  not  say  how),  to  .09  for  N  =  592.  We  think  this  is  too  small. 


Eye  Color 

Black 

Hair  Color 

Brunette  |  Red  |  Blond 

r% 

Brown 

23 

40 

9 

2 

74 

Blue 

IHQH 

28 

6 

32 

73 

Haxel 

5 

18 

5 

3 

IHKui 

Green 

2 

10 

5 

5 

|  22 

ci  1 

37 

96 

25 

42 

j-N’=200 

Figure  5.16:  Re-scaled  Hair/Eye  Color  Table. 

We  experimented  with  a  table  in  between.  For  N'  —  200,  the  scaled  table  appears  in  Figure  5.16. 
The  Diaeonis- Efron  approximation  for  the  number  of  such  tables  is  on  the  order  of  1011,  »o  ihat  an 
exhaustive  significance  calculation  is  not  feasible.  Their  analytic  approximation  with  maximum  pro¬ 
trusion  correction  estimates  the  significance  at  0.13.  Using  our  method,  we  estimate  the  significance 
at  0.136.  We  took  2000r  =  4816000  samples  in  each  of  five  runs,  with  133914  steps  used  to  get  near 
stationarity  in  each  run.  Assuming  the  eigenvalue  bound  holds,  and  that  the  actual  significance  is 
0.13,  this  would  have  given  us  accuracy  to  within  .02  with  high  probability.  We  therefore  believe 
the  exact  significance  is  in  the  range  0.13  ±  .02.  The  experiment  took  2975  CPU  seconds  and  just 
under  53  minutes  of  real  time. 

5.5  Provably  Polynomial-Time  Methods 

Under  the  hypothesised  eigenvalue  bound,  the  random  walk  method  we  suggest  gives  a  significance 
estimate  within  «  of  the  true  value  with  probability  at  least  1—0,  in  time  polynomial  in  the  dimensions 


102 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


of  the  table  m  and  n,  the  grand  sum  N ,  1/e,  and  lg(l//3).  However,  the  eigenvalue  bound  is  only 
conjectural. 

Using  results  of  Dyer,  Frieie,  and  Kannan  [DFK89],  one  can  prove  that  a  modified  version  of  our 
algorithm  converges  in  polynomial  time  for  a  more  substantial  and  interesting  subclass  of  sets  £rc. 
Their  methods  can  also  be  applied  to  get  other  provably  polynomial-time  randomised  algorithms  for 
estimating  significance  under  the  uniform  model,  (as  well  as  some  other  problems  that  can  be  solved 
by  sampling,  such  as  approximate  counting).  In  both  cases  the  resulting  polynomial  bounds,  do 
not  give  very  practical  running  times.  Not  only  are  the  degrees  of  the  polynomials  substantial,  but 
the  constant  factors  are  also  large.  In  practice,  one  must,  as  we  did,  conjecture  faster  performance 
to  match  the  empirical  performance  of  our  suggested  method.  So  the  results  in  this  section  do  not 
seem  to  lead  to  a  better  practical  solution  to  the  problem. 

We  will  not  give  a  detailed  description  of  their  arguments,  but  we  will  show  how  to  apply  their 
ideas.  For  details,  the  reader  will  need  to  refer  to  their  paper.  We  use  some  terminology  we 
introduced  in  Section  5.3,  and  we  need  to  introduce  some  more. 

For  any  point  x  =  (*j,  X2,  *  •  • ,  x<*)  €  J?1,  let  cube(x)  denote  the  d-dimensional  unit  cube  centered 
at  x, 

cube(x)  =  {y  €  R*  I  ^2  |y.  -  <  1}. 

» 

The  integer  lattice  in  d-dimensions  is  the  infinite  graph  obtained  by  joining  every  two  points 
with  integer  coordinates  whose  distance  is  exactly  1.  This  graph  is  2d- regular. 

Let  K  be  any  convex  set  in  R* .  Consider,  now,  the  finite  induced  subgraph  of  the  integer  lattice 
whose  vertices  V  are  those  points  v  with  integer  coordinates  such  that  cube(v)  intersects  K, 

V  =  {v£Zd\  cnbe(»)  n  K  ?  0>. 

To  any  vertices  v  £  V,  whose  degree  degv  is  less  than  2d,  add  2d  -  deg  v  self-loop  edges.  Call  the 
resulting  graph  Latt(Ff),  the  lattice  graph  of  K.  It  is  2d-regular,  and  is  connected  (by  convexity 
of  Jif).  Note  that  the  vertex  set  V  contains  two  types  of  points.  It  contains  points  in  Zd  that  are 
in  K ,  which  we  call  interior  vertices,  and  it  also  contains  points  that  are  not  in  if,  but  whose 
cubes  intersect  K,  which  we  call  boundary  vertices.  Note  that  if  boundary  vertices  were  excluded, 
it  could  destroy  connectivity. 

Call  a  set  in  R?  large  and  well-rounded  if  it  contains  a  ball  Bj  of  radius  pj  and  is  contained 
in  a  ball  Bo  of  radius  po  such  that  pj  >  20d5/2  and  po/pi  <  Vd(d  4- 1). 

Theorem  5.17  (from  [DFK89])  Let  K  be  a  large  and  well-rounded  convex  set .  Let  P  be  the  strongly- 
aperiodic  random  walk  on  Latt(if).  TAen  P  is  time- reversible,  ergodic ,  and  has  uniform  stationary 
distribution  on  the  vertices  V,  and  the  second  absolute  eigenvalue  of  P  satisfies 

**(P)  <  1  -  l/4000000d9p}. 


5.5.  FROVABLY  POLYNOMIAL-TIME  METHODS 


103 


Proof:  That  P  is  time-reversible,  ergodic,  and  symmetric  follow  from  the  facts  that  P  is  a  random 
walk  on  a  connected  regular  graph.  Scale  coordinates  in  all  dimensions  by  1/p/.  This  puts  one  in  the 
framework  used  by  [DFK89]:  Bj  becomes  the  unit  ball,  and  their  value  6  =  1/p/  <  l/20d5/J.  Note 
that  this  is  what  they  call  the  natural  Markov  chain,  and  not  the  technical  Markov  chain.  Now  apply 
their  Lemma  1.  It  gives  a  bound  on  the  edge-expansion  of  the  graph,  which  we  multiply  by  1/4 d 
(the  minimum  probability  on  any  transition  of  the  strongly  aperiodic  chain)  to  get  a  conductance 
bound.  Then  square  and  replace  6  with  1/p/  to  obtain  this  version.  U 

Lemma  5.18  (from  [DFK89])  If  K  is  a  large  and  well-rounded  convex  set  in  Rd ,  then  the  ratio  of 
the  number  of  boundary  vertices  o/Latt(A')  to  the  number  of  interior  vertices  of  Latt(A)  is  at  most 
3/20 d. 

Proof:  This  is  their  Proposition  3,  with  their  value  rj  =  5  <  1/20 di/2.  t§ 

Now,  for  a  given  instance  of  Src,  let  d  =  (m  -  l)(n  -  1)  and  let  D  be  the  convex  d-dimensional 
polytope  consisting  of  the  set  of  all  (m-  1)  x  (n-  1)  tables  X  =  (A'<y)  with  nonnegative  real-valued 
entries  satisfying 

XN  ^ r* fot  each  *.  i  <  *  < m  52  -  c> fot  each  2> 1  —  *  < n  (5-3) 

i<><*»  !<»<m 

The  interior  vertices  of  Latt(D)  correspond  precisely  with  tables  in  Ere-  The  random  walk  on  the 
graph  Latt(D)  is  very  similar  to  our  basic  random  walk  (Algorithm  5.2.  To  simulate  the  walk  on 
Latt(D),  we  add  1  or  subtract  1  from  some  entry  of  the  table  Tij  with  1  <  i  <  m  and  1  <  j  <  ri. 
This  corresponds  to  a  move  of  Algorithm  5.2  in  which  i\  =  i,  j\  =  jt  *2  =  mi  32  =  n»  except  that 
now  we  allow  the  table  to  be  ‘slightly  outside’  of  Erc*  (Changing  all  entries  by  at  most  1,  one 
will  be  able  to  get  to  a  table  within  Ere-) 

Call  an  instance  of  Ere  large  and  well-rounded  if  the  associated  d-dimensional  polytope  D  is 
large  and  well-rounded.  The  following  corollary  is  almost  immediate. 

Theorem  5.19  (Near-Uniform  Generation  in  Large  Well-Rounded  Erc)  Then  is  an  algo¬ 
rithm  which,  given  any  large  and  well-rounded  instance  o/Er c»  an  *  $  (with  0  <  c  <  1  and 

0  <  6  <  1),  runs  in  time  polynomial  in  m,  n,  N,  and  lg  1/e,  Ig  1  /S.  At  its  termination ,  with  prob¬ 
ability  at  least  1  -  6,  the  algorithm  reports  a  Success*  and  outputs  a  table  in  Erc  according  to  a 
distribution  that  is  within  ratio  e  of  the  uniform  distribution.  Otherwise  it  reports  a  \ failure  *  and 
outputs  no  table. 

Proof:  Use  the  random  walk  to  draw  samples  from  the  vertices  of  Latt(D),  within  ratio  e  of 

uniform.  It  is  clear  that  we  can  simulate  steps  of  the  walk  time  polynomial  in  m,  n,  and  N. 
Theorem  5.17  and  Theorem  1.8,  together  the  facts  that  we  must  have  p/  <  N ,  and  |£rcl  <  Nd> 


104 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


imply  that  0(NAd?(d\n  N  -J-  lg(l/e)))  steps  suffice  to  get  a  vertex  distributed  within  ratio  <  of 
uniform  on  Latt(D).  We  reject  the  sample  if  it  is  a  boundary  vertex.  By  Lemma  5.18  and  the 
near-uniformity  of  the  sample,  this  happens  with  probability  at  most  ^(1  -f  c)  <  3/10.  The  result 
foDows  by  repeating  0(lgl/6)  times,  stopping  and  reporting  a  ‘success*  if  a  sample  is  an  interior 
vertex,  and  outputting  the  corresponding  table  in  Ere-  We  report  ‘failure*  if  none  of  the  0(\gl/S) 
samples  are  interior  vertices.  £3 

It  follows  that  we  can  use  this  alternative  technique  in  our  suggested  method  to  get  good  estimates 
for  significance  or  any  other  mean-value  quantity. 

Corollary  5.20  For  large  well-rounded  instances  of  Ere  there  is  an  algorithm  that  given  any  table 
T  €  Ere  run*  in  time  polynomial  *n  within  m,  n,  and  N,  1/e  and  lg  1  /6  and  outputs  an  estimate 
M  for  the  volume-based  significance  p  ofx7{T)  such  that  \M  -  p|  <  e  with  probability  at  least  1/6 . 

Large  well-rounded  instances  of  Ere  correspond  to  instances  in  which  the  grand  sum  is  large 
and  the  margins  are  quite  regular.  The  following  example  illustrates  the  most  regular  case.  Other 
classes  of  well-rounded  instances  can  be  established  in  a  similar  way. 

Example  5.21  (Magic  Squares)  The  set  Hnr ,  magic  squares  of  order  n  and  margins  r,  is 
the  set  of  square  arrays  in  which  every  row  sum  and  column  sum  is  r.  (Some  authors  reserve  the  term 
‘magic  squares*  for  arrays  in  which  the  main  diagonals  also  sum  to  r.)  This  is  a  well-studied  class  of 
instances  of  Ere*  (See  [Sta86,  p.  232]  and  [GC77].)  It  is  not  hard  to  see  that  for  r  >  40(n- 1)7  these 
instances  are  all  well-rounded.  Let  d  =  (n  —  l)2.  The  d-dimensional  poly  tope  D  for  Hnr  contains 
the  d-dimensional  cube  [^r^,  It  follows  that  it  contains  a  ball  of  radius  at  least  half  the 

width  of  the  cube,  pj  =  Moreover,  D  is  clearly  contained  in  a  ball  of  radius  po  =  rn-1^2.  The 
ratio  of  these  two  is  po/pi  <  2 d3^4,  and  so  it  follows  that  for  r  >  40d7^2  =  40(n  —  l)7  that  D  is 
large  and  well-rounded.  From  the  algorithm  above,  we  will  be  able  to  get  nearly  uniform  random 
magic  squares  in  roughly  0(r4d*  In  r)  steps  of  the  walk.  Once  near  uniformity,  one  can  obtain  ‘good* 
samples  in  0(r4d5)  steps  using  the  ideas  in  Chapter  3.  □ 

Remark:  More  generally,  the  algorithm  described  in  [DFK89]  can  be  used  to  estimate  significance 
in  cases  that  are  not  well-rounded.  The  results  of  [DFK89]  can  be  used  to  approximate  the  number 
of  tables  in  £rc  and  the  number  of  tables  in  a  given  significance  region  as  follows:  D  is  convex 
but  not  well-rounded,  there  is  an  affine  transformation  of  Rd  that  takes  D  to  a  large  well-rounded 
convex  set  K,  and  this  transformation  is  polynomial-time  computable  [DFK89,  GLS88].  However, 
the  image  in  K  of  the  integer  lattice  points  in  D  will  form  a  grid  in  K  having  unequal  (but  regular) 
spacings  in  the  various  coordinates.  Still,  one  can  approximate  the  number  of  such  lattice  points 
within  K  (or  a  given  region  of  K)  by  using  a  finer  lattice  in  which  the  spacings  are  equal  in  all 
coordinates.  This  in  turn  can  be  used  to  approximate  the  number  of  lattice  points  in  D.  D 


5.6.  MORE  OS  THE  E1CES VALUE  HYPOTHESIS 


105 


5.6  More  on  the  Eigenvalue  Hypothesis 

Our  eigenvalue  hypothesis  was  based  on  Payne  and  Weinberger’s  Theorem  (5.7)  for  the  Laplacian  in 
the  continuous  setting.  Because  of  the  similarity  of  the  Rayleigh  quotients  involved  in  both  cases,  it 
is  natural  to  believe  that  the  eigenvalues  of  the  Laplacian  of  the  random  walk  on  the  lattice  within 
a  (large  enough)  body  K  will  be  approximately  that  of  the  body.  (Increasing  the  scale  of  the  body 
is  equivalent  to  placing  a  finer  mesh  within  the  original  body.) 

I  believe  that  the  following  statement  or  one  of  essentially  similar  form  holds  generally  for  lattice 
graphs  within  large  (not  necessarily  well-rounded)  convex  sets. 


Conjecture  5.22  (Lattice  Approximation  Hypothesis)  There  exists  a  ( low-degree }  polynomial 
r(d)  and  a  constant  c  such  that  if  K  is  any  body  in  Rd  containing  a  ball  of  radius  pj  >  r{d),  the 
following  holds  for  Latt(jf)  =  (Vi  E). 

If  4>  :V  R  is  any  function  with  tf>2(v)  =  1  ap *  Ht$v  ^(v)  =  0,  then  there  is  a  function 

f  on  K  with  bounded  second- derivative  such  that 

f  f2dK  =  1  f  fdK  =  0 
Jk  Jk 

and 

f  \Vf\2dK<e  £  (*(»)-*(»))*• 

Jk  <={r  ,V)CS 

If  this  were  true,  one  could  prove  the  following  useful  result. 


Theorem  5.23  Suppose  the  preceding  conjecture  holds,  and  let  K  be  a  convex  body  in  Rd  with 
diameter  A.  Let  G  =  (V,  E)  =  Latt(Jf)  let  P  be  the  natural  random  walk  on  G,  and  let  L  ~  I  -  P 
be  the  associated  Laplacian.  Then 


Proof:  Let  <f>  be  an  eigenfunction  of  L  =  I  -  P  corresponding  to  mi(£).  normalized  it  so  that 
Hi  ^3(v)  =  1*  This  satisfies  the  preconditions  of  the  conjecture  above,  so  let  /  be  a  corresponding 
function  on  K  guaranteed  by  the  onjecture.  This  function  is  admissible  in  the  Payne- Weinberger 
lower  bound  (5.1),  which  gives  us 

£  m)-«*))*>\f  iv/ij^>^. 

«={*.»>€£  K 


£  (*(»)-*(*))’> 

e={*,jr)€£ 


T 


2 


2<£cA2‘ 


Therefore,  by  (18),  we  have 


105 


CHAPTER  5.  ESTIMATING  THE  SIGNIFICANCE  OF  CONTINGENCY  TABLES 


The  theorem  would  imply  0(nA2in|U|)  C  0(n2A2lnA)  convergence  for  walks  on  connected 
lattice  graphs  within  sufficiently  large  convex  bodies.  The  conjecture  could  be  weakened  slightly  to 
allow  c  to  depend  as  a  small  polynomial  on  the  dimension  n,  with  a  corresponding  weaker  theorem. 
Remark:  Dyer,  Friere,  and  Kannan  use  an  isoperimetric  inequality  combined  with  Jerrum  and 
Sinclair’s  conductance  bounds  (see  Theorem  1.16)  and  it  seems  that  the  resulting  squaring  gives 
them  an  eigenvalue  bound  that  depends  on  A  essentially  as  1/A4  for  fixed  dimension.  We  know 
that  if  K  is  a  d-dircensional  rectangular  box  the  eigenvalue  depends  on  A  as  1/A2. 

Lovasr  and  Simonovits  [LS90]  give  an  isoperimetric  inequality  for  similar  lattice  graphs,  that 
almost  proves  a  bound  of  this  form.  Unfortunately,  they  are  forced  to  neglect  sets  whose  sixe  falls 
below  a  certain  value  m(D).  This  arises  because  of  problems  with  the  isoperimetric  inequality  at 
sets  of  points  taken  near  the  boundary.  They  bound  the  sise  of  the  boundary  by  m(D)  and  neglect 
them  in  their  inequality.  The  resulting  isoperimetric  inequality  does  not  give  them  provably  rapid 
mixing  to  sero  variation  distance,  but  to  a  distance  that  depends  on  the  maximum  total  variation 
of  the  initial  distribution  from  the  stationary  distribution  r  on  sets  of  sets  of  siie  m(D).  They 
show  that  they  can  still  use  this  bound  in  their  application. 

We  are  suggesting  that  it  may  be  better  to  handle  the  problem  directly  as  a  variational  problem, 
instead  of  using  an  isoperimetric  inequality.  It  may  be  the  case  that  only  a  weaker  isoperimetric 
inequality  (as  that  of  Lovast  and  Simonovits)  holds  for  these  graphs.  O 


Chapter  6 

Directions  for  Future  Work 


There  are  several  areas  that  seem  promising  to  investigate  in  the  near  future.  Here  are  some  ideas 
to  consider  that  relate  to  the  work  in  this  thesis. 

1.  Many  of  the  beautiful  convergence  results  are  based  on  isoperimetric  inequalities  that  had 
arisen  earlier  in  the  study  of  the  Laplacian  on  Rieraannian  manifolds.  Much  of  the  framework 
used  in  Chapter  1  was  based  on  geometric  intuitions  from  the  continuous  setting.  Almost 
certainly,  there  are  more  connections  to  explore.  A  solid  background  in  analysis  is  probably 
required.  I  felt  that  this  was  a  problem  in  resolving  the  approximation  conjecture  in  the  last 
chapter.  That  result,  as  well  as  others  relating  these  two  branches  of  mathematics  would  be 
extremely  interesting. 

2.  In  the  last  couple  of  years  many  authors  have  recognised  the  remarkable  prospects  for  using 
expander  graphs  to  reduce  randomness  requirements  in  probabilistic  algorithms.  In  Chap¬ 
ter  4,  we  proved  some  new  independence-like  properties  of  correlated  samples  obtained  from 
expanders,  discussed  some  properties  that  others  had  proved,  and  showed  a  new  way  in  which 
they  could  be  combined.  What  further  properties  do  such  samples  nearly  share  with  truly  in¬ 
dependent  samples?  Good  results  in  this  arena  would  have  implications  in  many  probabilistic 
algorithms. 

3.  Most  of  the  tight  convergence  bounds  for  Markov  chains  come  from  full  spectrum  bounds, 
which  in  turn  come  from  harmonic  analysis.  When  can  walks  on  graphs  be  lifted  to  walks  on 
groupt?  This  question  was  raised  and  partially  answered  by  Diaconis  and  Shahshahani  [DS87]. 
We  discussed  some  such  cases  in  Chapter  2,  with  the  purpose  of  showing  some  general  diameter- 
based  bounds  for  such  walks,  even  when  the  harmonic  analysis  does  not  seem  tractable.  These 
are  nice  in  the  absence  of  other  bounds,  but  when  one  can  do  the  harmonic  analysis  one  does 
better.  Often  one  can’t.  So  is  there  anything  in-between?  For  example,  can  one  sometimes 


107 


i: 


103 


V- 


1 


III 


CHAPTER  6.  DIRECTIONS  FOR  FUTURE  WORK 


use  symmetry  and  knowledge  of  a  particular  group  to  construct  a  system  of  paths  that  gives 
a  convergence  bound  that  is  o(A2ln|lr|)?  What  about  coupling  methods?  These  questions 
seem  hard.  The  advantage  to  the  interested  investigator  is  that  there  are  some  nice  examples 
that  have  been  treated  with  tight  results,  and  the  background  material  is  well-developed. 
See  [Dia88]. 

A  matroid  in  5  is  a  family  I  of  subsets  I  C  S  that  is  dosed  under  containment  (T  an  ideal),  and 
such  that  the  cardinality  of  every  maximal  subset  I  is  the  same.  The  Bernoulli- Laplace  process 
that  we  studied  in  Chapter  2  can  be  viewed  as  a  random  walk  on  the  basis-exchange  graph 
of  the  trivial  matroid  whose  bases  (maximal  independent  sets)  are  all  h-sets  of  [n].  Broderis 
random  walk  on  matchings  [Bro86,  JS88]  may  be  viewed  as  a  walk  on  the  top  two  layers  of  the 
graph  of  the  associated  matroid.  Are  there  any  general  conductance  bounds  for  such  walks? 
Intuitively,  ideals  in  the  hypercube  are  ‘convex*  and  should  not  have  conductance  much  worse 
than  the  cube.  Mihail  and  Varirani  [MV88]  have  posed  conjectures  about  the  conductance  of 
similar  graphs.  A  geometric  technique  seems  promising.  This  question  for  the  spanning-tree 
matroid  has  still  not  been  answered  despite  significant  efforts  by  several  people.  Answers  could 
be  applied  to  sampling  in  matroids. 

To  what  other  problems  can  the  existing  eigenvalue  bounds  be  applied?  It  seems  like  one 
reasonable  place  to  look  is  in  the  field  of  numerical  analysis,  and  at  iterative  matrix  methods 
in  particular. 


Appendix  A 

Notions  of  Approximation 


A.l  Point  Approximations 

For  a,  a  >  0  and  r  >  1,  we  say  that  a  approximates  a  to  within  ratio  r  if 

a/r  <  a  <  2r. 

Note  that  we  require  all  of  the  quantities  involved  to  be  positive .  This  form  of  measuring  error  is 
essentially  equivalent  to  the  usual  notion  of  relative  error  but  more  easily  handled  in  combinational 
settings.  Recall  that  one  says  2  approximates  a  to  within  relative  error  €  <  1  if 

a(l  -  «)  <  a  <  a(l +  e). 

If  2  approximates  o  within  ratio  (1  +  e),  with  c  <  1,  then  2  approximates  a  to  within  relative  error  e. 
Almost  conversely,  if  and  2  approximates  a  within  relative  error  c  <  1 ,  then  o  approximates  a  within 
ratio  1/(1  -  e)  =  1  +  e  +  c2  +  *  •  •,  which  is  at  most  1  +  2c  if  e  <  1/2. 

Here  are  some  basic  properties  of  approximation  within  ratio  that  the  reader  should  keep  in 
mind.  Suppose  that  2  approximates  a  and  b  approximates  6,  each  within  ratio  r.  Then  2  +  6  also 
approximates  a  +  b  within  ratio  r.  We  call  this  the  sum  rule  for  approximations  within  ratio.  By 
induction,  it  holds  for  all  finite  sum-  of  positive  summands.  There  is  the  following  similar  product 
rule.  If  2  approximates  a  within  ratio  r  and  b  approximates  b  within  ratio  r*  then  a  b  approximates 
the  product  ab  within  ratio  rrf .  A  useful  corollary  is  that  if  ai>  oj)  •  •  • » approximate  a\%  gj,  . . . ,  On* 
respectively,  each  within  ratio  1  +  £  for  c  <  1,  then  the  product  fl approximates  the  product 
Hi  within  ratio  (1  +  ^)n  <  (1  +  e). 


109 


no  APPENDIX  A.  NOTIONS  OF  APPROXIMATION 

A. 2  Approximate  Distributions 

To  be  able  to  say  with  precision  that  two  distributions  are  “close*  to  each  other,  we  utilise  various 
measures  of  distance  between  distributions:  (total)  variation,  and  separation.  Here  we  state  their 
definitions  and  some  of  their  properties. 

A. 2.1  Total  Variation 

Variation  distance  or  total  variation  is  a  common  statistical  measure  of  distance  between  two  distri¬ 
butions  based  on  the  Cl  norm. 

If  P  and  Q  are  two  probability  distributions  on  a  finite  set  5,  the  variation  distance  or  total 
variation  between  P  and  Q}  denoted  || P  —  <?||,  is  defined  by: 

«€S 

Some  authors  define  variation  without  the  1/2  factor. 

Proposition  A.l  Variation  distance  satisfies  the  following  properties: 

1.  0<||P-<?H<1; 

2.  ||P  —  Q||  is  a  metric:  it  is  nonnegative;  it  is  zero  if  and  only  if  P  —  Q ;  and  it  satisfies  the 
triangle  inequality; 

S.  \\P-Q\\  =  m*xAss\P(A)-Q(A)\; 

l  IIP  -  911  =  P(A)  -  Q(A),  where  A  =  {s  €  5  |  P(s)  >  Q(s)}; 

5.  ||P  -  9||  =  |  supy:j|<fj|oe<1  |Ep[/J  -  E<?[/]|,  where  f  is  a  real-valued  function  on  5. 

Proof:  (Outline)  (1)  Easy.  (2)  This  follows  from  metric  properties  of  the  C1  distance.  (3)  (4)  (5) 
These  follow  by  separating  5  into  the  three  classes:  A  =  {s  |  P(s)  >  <?($)},  B  —  {s  |  P(s)  =  Q(s)}, 
and  C  =s  {s  \  P(s)  <  Q(s)}.  Show  that  A  simultaneously  gives  the  maximum  and  equality  with  the 
definition,  yielding  (3)  and  (4).  To  get  (5)  define  a  function  /  that  is  1  on  elements  in  A ,  takes  any 
values  on  elements  in  B}  and  —1  on  elements  in  C,  and  show  that  this  yields  the  stated  supremum 
and  equality  with  the  definition.  ■ 


A. 2. 2  Separation  and  Relative  Pointwise  Distance 

Relative  pointwise  distance  is  also  a  measure  of  discrepancy  between  distributions,  but  is  not  a 
metric.  We  define  the  relative  pointwise  distance  from  P  to  Q  as 

|P(*)  -<?(«)! 


rpd(P,  Q)  =  ma*  • 


<?(») 


A.2.  APPROXIMATE  DISTRIBUTIONS 


111 


in  other  words,  the  maximum  relative  error  at  any  point.  This  distance  is  not  symmetric.  Generally, 
Q  is  taken  to  be  some  fixed  distribution*  and  we  talk  about  the  distance  of  varying  distributions  P 
from  the  fixed  Q.  Notice  that  this  distance  is  only  defined  when  <?(s)  >  0  for  all  a  £  S.  In  this 
work,  Q  is  generally  the  stationary  distribution  of  an  ergodic  Markov  chain,  and  such  a  distribution 
is  necessarily  positive  everywhere  on  the  state  space. 

Closely  related  to  relative  pointwise  distance  is  the  separation  of  P  from  <?,  denoted  sep(P,  Q). 
It  is  defined  by:  ... 

sep(P,Q)  =  Tax£?(^-pl. 

Note  the  absence  of  the  absolute  value.  Separation  arises  naturally  in  the  study  of  strong  stationary 
times.  [AD86] 

Proposition  A.2  Separation  and  relative  pointwise  distance  satisfy  the  following  properties: 

1.  rpd(P,  Q)  >  sep(P,  Q)  >  0; 

2.  rpd (P,  Q)  =  0  (also  sep (P,  Q)  =  0)  if  and  only  if  P  =  Q ; 

3.  sep(P,  Q)  <  1,  with  equality  if  and  only  if  P(s)  =  0  for  some  s  £  S; 

4-  ||P-Q||<|rpd(P,Q) 

5.  ||P-<?||<sep(P,<?) 

6.  if  U  is  the  uniform  distribution ,  then  rpd (P,  U)  <  |5|  —  1. 

Proof:  (Outline)  (1)(2)(3)  Easy.  (4)  By  definition,  for  all  s  £  S  we  have  |P(s)  -  <?(s)|  < 

xpd (P,  Q)(?(a).  Sum  both  sides  of  this  inequality  over  s  £  S  to  get:  |P(j)  —  Q(a)|  <  rpd(P ,<?). 

Then  divide  by  two  to  get  the  desired  result.  (5)  Let  C  be  the  set  defined  in  the  proof  of  item  4  of 
Proposition  A.l.  Then  we  have 

IIP  -  <311  =  Q(C)  -  P(C)  =  £  - sep(p’ X,  GW  ^  sepW.  G)- 

.€C  }  *€C 

(6)  Again,  easy.  B 


A.2. 3  Approximation  within  Ratio 

For  two  distributions  P  and  Q  supported  everywhere  on  5,  we  say  that  P  approximates  Q  within 
ratio  r  if  for  all  points  s  £  5,  P(s)  approximates  Q(s)  within  ratio  r  pointwise. 

Our  earlier  comments  relating  relative  error  and  approximation  within  ratio  for  single  points 
carry  over  here.  In  particular,  if  for  some  positive  c,  P  approximates  Q  within  ratio  (1  +  c),  then 
rpd (P,Q)  <  c.  Almost  conversely,  if  rpd(P,Q)  <  c/2,  and  c  <  1,  then  P  approximates  Q  within 
ratio  1  +  c. 

The  following  two  properties  are  fundamental. 


112 


APPENDIX  A.  NOTIONS  OF  APPROXIMATION 


Proposition  A.3  If  distribution  P  approximates  Q  within  ratio  r  on  a  finite  set  V ,  then  for  every 
subset  A  C  Vt  P(A)  =  P(v)  is  within  ratio  r  ofQ(A). 

Proof:  Immediate  from  the  sum  rule,  ffl 

Proposition  A.4  Let  X  and  Y  be  drawn  from  a  finite  set  V  with  distributions  P  and  Q ,  respectively. 
Suppose  that  P  approximates  Q  within  ratio  r .  Then  for  any  /unction  b  on  V ,  the  distribution  of 
b{X)  also  approximates  the  distribution  ofb(Y)  within  ratio  r. 

Proof:  For  any  value  z  in  the  range  of  6,  consider  the  probability  P x{b(X)  =  z }.  We  wish  to  show 
that  this  is  within  ratio  r  of  the  probability  Pr{6(y)  =  2}.  For  this,  simply  consider  the  set  V,  of 
v  €  V  such  that  b{v)  =  2.  Now,  Pr{6(X)  =  2}  =  ?x{X  €  Vt}  =  P{Vt).  Applying  the  preceding 
proposition,  we  find  that  this  is  within  ratio  r  of  Q(Vi)  =  Pr{y  €  li}  =  Pi{6(y)  =  2}.  Since  this 
holds  for  each  2  in  the  range  of  6,  we  have  proved  the  theorem.  Gl 


A. 2. 4  Kolmogorov-Smirnov  Distance 

In  reporting  our  experiments  in  Chapter  5  we  use  a  distance  measure  that  we  call  Kolmogorov- 
Smirnov  distance.  We  explain  that  notion  here. 

Recall  that  if  X  is  a  real-valued  random-variable,  the  function  F(x)  =  Pr{X  <  x}  is  called  its 
cumulative  distribution  function.  If  F(x)  and  G(z)  are  the  cumulative  distribution  functions 
for  two  random  variables,  the  Kolmogorov-Smirnov  distance  (K-S  distance)  between  them  is 

8UP*€-R  l^(*)  ~ 

Since  the  significance  level  of  a  given  ‘observed*  value  *0  under  the  ‘null  hypothesis*  that  it 
was  drawn  according  to  the  distribution  F  is  F(x0),  the  K-S  distance  gives  the  maximum  absolute 
difference  between  any  two  significance  levels  for  the  same  observed  value,  one  measured  under  F 
and  the  other  under  G .  So,  for  example,  if  F  is  an  approximation  we  construct  for  G,  and  the  main 
purpose  of  the  approximation  is  to  obtain  approximations  to  significance  levels  under  G,  the  K-S 
distance  is  the  natural  distance  measure  to  use,  and  this  is  why  it  is  applied  in  Chapter  5. 

The  quantity  we  report  as  K-S  distance  is  actually  a  discretixed  version  of  K-S  distance.  If  /  is 
a  histogram,  let  P(6)  denote  the  cumulative  sum  over  all  bins  up  to  and  including  bin  6.  If  F  and  G 
are  the  cumulative  sums  for  two  histograms  with  common  bin  sixes,  the  quantity  we  report  as  K-S 
distance  is  the  maximum  value  of  \F(b)  —  G(6)|  over  all  bins  6. 


Appendix  B 

Sampling  from  Near-Stationary 
Chains 


In  Chapter  1  we  showed  a  number  of  techniques  for  obtaining  convergence  guarantees.  In  this 
appendix  we  prove  some  results  showing  how  to  use  convergence  guarantees  to  yield  guarantees 
i  bout  the  quality  of  statistics  based  on  samples  from  the  nearly  stationary  Markov  chain. 

In  Section  B.l,  we  show  that  well-spaced  nearly  stationary  samples  from  a  Markov  chain  yield 
nearly  independent,  nearly  stationary  samples  and  that  statistics  based  on  these  samples  will  nearly 
match,  in  distribution,  the  statistics  based  on  independent  and  identically  distributed  (i.i.d.)  samples 
drawn  according  to  the  stationary  distribution. 

Recall  that  in  Chapter  3  we  discussed  mean  value  estimation  using  highly  correlated  closely 
spaced  samples  drawn  from  a  Markov  chain.  There  we  assumed  that  the  chain  started  exactly  in  the 
stationary  distribution.  This  was  fine  for  the  applications  described  in  Examples  3.7  and  3.8  and  in 
Chapter  4,  where  it  was  easy  to  get  an  initial  sample  drawn  according  to  the  stationary  distribution. 
However,  often  one  may  not  have  an  efficient  method  of  starting  the  chain  exactly  in  the  stationary 
distribution.  In  fact,  the  Markov  chain  may  be  the  only  known  means  of  generating  even  nearly- 
uniform  samples  efficiently.  (This  is  the  case,  for  example,  in  Example  3.9  and  in  Chapter  5).  In 
these  cases  one  wants  first  to  simulate  the  chain  until  one  knows,  by  a  convergence  guarantee,  that 
the  chain  is  nearly  stationary,  and  then  one  wants  to  begin  sampling.  In  Section  B.2,  we  show  how 
to  extend  the  error  bounds  in  Corollary  3.5  to  this  situation  of  sampling  from  the  near-stationary 
chain.  We  also  explain  how  the  Median  Lemma  (Lemma  3.1)  can  be  applied  with  such  estimates, 
provided  that  between  successive  estimates  we  take  enough  steps  to  insure  their  near-independence. 

We  work  here  with  the  notion  of  'approximation  within  ratio.’  Two  important  combinational 
properties  of  approximation  within  ratio,  the  sum  and  product  rule,  are  discussed  in  Appendix  A. 
The  reader  should  be  familiar  with  that  material  before  proceeding. 


113 


114 


APPENDIX  B.  SAMPLING  FROM  NEAR-STATIONARY  CHAINS 


B.l  Nearly  Independent,  Near-Stationary  Samples 

When  samples  are  diawn  from  a  Markov  chain,  we  may  not  be  able  to  claim  that  each  successive 
sample  is  independent  or  even  exactly  uniformly  distributed.  However,  we  will  typically  have  guar¬ 
antees  that  regardless  of  the  values  of  prior  samples,  each  successive  sample  has,  to  within  a  small 
tolerance,  approximately  the  stationary  distribution.  In  this  case,  we  can  guarantee  that  they  share 
important  approximation  properties  with  true  i.i.d.  samples. 

Suppose  that  Yu  Ti.  *  •  *  i  Yn  arc  each  drawn  randomly  from  V .  We  will  say  that  they  are  jointly 
iid  Q  within  ratio  1  +  e  if  the  joint  distribution  of  Yu  Yu . . . ,  Yn  approximates  the  product  dis¬ 
tribution  Qn  of  n  independent  (J-distributed  random  variables  within  ratio  1  +  e.  The  following 
theorem  tells  us  that  well-spaced  samples  drawn  from  a  Markov  chain  are  jointly  i.i.d.  r  within 
ratio  1  +  e . 

Theorem  B.l  Let  {X*  |  Jfc  >  0}  be  a  time-reversible  ergodic  Markov  chain  with  a  convergence 
guarantee  T(e). 

Ift>  T(e/2n)  and  |  1  <  »  <  n)  ore  n  samples  drawn  every  t  steps  from  the  chain .  Then 
these  samples  are  jointly  i.i.d.  r  within  ratio  1  +  e. 

Proof:  Let 

Pi(x;  *lj  •  i  *t-l)  “  “  *  I  Xf  =  *li  ^2t  =  ®2i • • •»  = 

Then  we  have 

Pr{A*j  =  Zu  X^t  =  ®2»  •  •  •  i  =  *n}  =  JJ  P*(®»  *2*  •  •  •  i 

1<*<» 

Now,  by  the  Markov  property,  we  have 

*1»  *2i  •  •  * »  *t-l)  =  =  *  |  At(t-i)  =  *i-l} 

and  by  the  convergence  guarantee  of  T(e/2n),  this  approximates  the  distribution  ir  within  ratio 
1  +  By  the  product  rule  for  approximations  within  ratio,  the  product  fit  *i»  *2>  •  •  -  i  *»-i) 
approximates  the  product  of  *(*»)  within  ratio  1  H-  e.  (See  Appendix  A.)  B 

The  next  theorem  tells  us  that  statistics  based  on  such  nearly  i.i.d.  samples  axe  dose,  in  distri¬ 
bution,  to  the  corresponding  statistics  based  on  true  i.i.d.  samples. 

Theorem  B.2  Let  b  be  any  function  on  Vn.  Jf{Xi  |  1  <  i  <  n}  ore  jointly  t\i.d.  r  within  ratio  1+c, 
then  the  distribution  o/6(Xi,  X2,  • . An)  is  within  ratio  1+e  of  the  distribution  ofb(YuYu...,Yn)9 
where  Y\y  Yi> . . Yn  ore  i.i.d.  samples  from  V  with  common  distribution  r. 

Proof:  Since  the  Xi  are  jointly  i.i.d.  r  within  ratio  (1  +  e),  by  definition  the  distribution  of 

(XiiXaf'iATn)  on  V*  is  within  ratio  (1  +  c)  of  the  distribution  of  (Pi,  *2i- ••»!*)•  Apply  Propo¬ 
sition  A.4  from  Appendix  A.  □ 


9 


9 


B.2.  OTHER  NEAR-STATIONARY  SAMPLES 


115 


B.2  Other  Near- Stationary  Samples 

In  Chapter  3  we  introduced  the  mean  value  estimator  C* ,  which  was  the  sample  mean  based  on  n 
samples  drawn  some  t  steps  apart  from  a  Markov  chain  that  was  evolving  exactly  in  the  stationary 
distribution.  Here  we  show  how  the  error  bounds  we  proved  there  hold  approximately  when  we  draw 
the  samples  from  a  nearly  stationary  chain.  We  assume  the  reader  has  already  studied  the  material 
in  Chapter  3. 

Let  P  be  an  ergodic  Markov  chain  with  stationary  distribution  x  on  the  finite  state  space  V . 
Take  any  t  >  1,  and  consider  samples  X0i  Xt,  X2u  .  - . ,  A(n-i)t  drawn  t  steps  apart  from  the  chain, 
where  xo  is  the  distribution  of  the  first  sample  X$.  These  n  samples  may  be  viewed  collectively  as 
a  sample  drawn  from  Vn  according  to  the  joint  probability 

Pr{(Xo,  Xty  X2t,  •  *  •  I  -^(n-l)t)  =  (*0»  ®1j  x2i  •  *  •  >  *n-l)}  =  ^(*0»  z0i  21>  •  •  • »  *n-l)» 

where 

D(Xo',  ®0j  *1i  *2»  •  •  *  I  2n-l)  =  *o(®o)-P^f)xpX,  ■P^f)xiX*'P^)*s*«  *  *  * 

Now,  if  the  distribution  x0  approximates  x  to  within  ratio  r,  then  the  value 

£(*o;  *0i  *i,  ®2i  •  *  •  >  *n-l) 

w^ll  approximate 

Z?(x;  *o*  *1*  *2i  ■  •  • »  ®»- W 

within  ratio  r.  (Note:  D( x;  x0,  *i,  x2}  . . . ,  x»-i)  is  the  distribution  of  n  samples  drawn  t  steps  apart 
from  the  chain  that  is  started  exactly  in  the  stationary  distribution  x.)  By  the  sum  rule,  this  extends 
to  subsets;  if  5  C  Vn  is  any  subset  of  states,  then  to  within  ratio  r,  the  probability 

Pr{(A*oi  X\ ,  X2,  •  •  •  i -X’n-i)  €  =  ^  ^(xo;*o»®i»*2i***»35n-i) 

(so.Xi,...  ,x»_i€5 

that  our  sequence  of  samples  lies  in  5  approximates  the  probability 

. 

of  the  same  event  had  we  started  the  chain  in  the  stationary  distribution  x. 

From  this,  the  next  theorem  foDows  immediately. 

Theorem  B.3  Let  P  be  an  ergodic  AforJbt*  chain  with  stationary  distribution  x  on  V.  Let  Xo  be 
drawn  according  to  xq,  and  let  Xo,  Xti  X2t,  •  •  •  >  X(n-2 )t  be  n  samples  drawn  t  steps  apart  from  the 
chain  starting  at  X0.  Similarly ,  let  Y0  be  drawn  accoring  to  x,  and  let  Y0}  Yt,Y2u ^  n 
samples  drawn  t  steps  apart  from  the  chain  starting  at  Y0.  If  x0  approximates  x  within  ratio  r,  then 
the  distribution  of{X0,X  on  V*  approximates  the  distribution  of  (Y0,  Ytj . . Yn-i)  to 

within  the  same  ratio  r. 


136 


APPENDIX  B.  SAMPLING  FROM  NEAR-STATIONARY  CHAINS 


Proposition  B.2  can  then  be  used  to  conclude  that  that  sampling  from  the  near-stationary  chain 
gives  statistics  close  (in  distribution)  to  sampling  from  the  stationary  chain.  For  example,  we  can 
extend  Co.o'Liary  3.5  as  follows. 

Corr.ilory  B.4  (ChebyshefF  Bounds  in  the  Near- Stationary  Case)  Adopt  the  setup  of  Theo¬ 
rem  3.S  and  its  corollaries  for  the  mean-value  estimator  C\  (with  t  odd)  on  a  time-reversible  ergodic 
Markov  chain.  However,  suppose  that  Xo  is  drotm,  not  from  the  stationary  distribution  x,  but 
instead  according  to  a  distribution  xq  that  approximates  x  within  ratio  (1  -f  e)  (e  >  0/  TAen  for 
n  >  (2r(t)  -  1)A,  and  t  odd ,  we  can  guarantee 

Pr{|Ci-A,|>/?}<^(l  +  e). 

In  particular  ifn>  (8r(i)  —  4)A2(1  +  c)/02 ,  *Aen 

Pr<|C7*  ~fci|  >/?}  < 

Proof;  View  C*  as  a  function  on  the  finite  set  Vn.  We  apply  the  same  reasoning  as  in  Propo¬ 
sition  A.4  together  with  the  preceding  theorem.  (In  fact,  combining  these  immediately  gives  the 
theorem,  but  we  explain  how  the  proof  of  the  Proposition  works  in  this  particular  setting.) 

Let  E  be  the  event  that  |C*  -  hx\  >  0.  This  event  occurs  if  (X0iXtlX2u  •*  falls 

in  a  certain  subset  Se  C  V*  of  possible  sequences  of  samples.  By  the  previous  theorem  and 
Proposition  A.3,  the  probability  that  (A*o,.Xt,  X2U  An-i)  falls  in  the  subset  Se  w  within  ratio 
(1  +  e)  of  the  probability  of  what  it  would  have  been  were  Xo  drawn  exactly  stationary.  We  gave  an 
upper  bound  on  the  probability  of  the  event  E  in  this  stationary  case  in  Corollary  3.5.  We  get  the 
new  bound  by  multiplying  that  probability  by  (1  +  e).  This  proves  the  theorem.  0 

Remark:  Note  that,  when  sampling  from  the  near-stationary  chain,  the  estimator  C£  is  no  longer 
unbiased  We  are  not  applying  a  ChebyshefF  bound  to  in  the  near-stationary  case.  Instead, 
having  proved  a  bound  on  the  probability  of  the  event  E  in  the  stationary  case,  we  can  apply  the 
preceding  Theorem  to  bound  the  probability  of  the  same  event  in  the  near-stationary  case.  This 
approach  seems  altogether  much  simpler  than  trying  to  reason  out  a  similar  bound  directly  using 
error  bounds  for  the  mean  and  variance  of  C*  in  the  near-stationary  case.  □ 


B.3  On  Near-Independence  and  the  Median  Lemma 

In  Chapter  3,  we  discussed  the  Median  Lemma  (Lemma  3.1,  page  54)  as  an  efficient  way  to  use 
repetition  to  reduce  the  probable  error  of  an  estimate.  That  lemma  relies  on  certain  independence 
properties,  and  it  may  not  immediately  be  clear  that  this  lemma  can  be  applied  with  the  estimator 
C£,  which  uses  correlated  samples. 


S'. 


9. 


r 


B.3.  ON  NEAR-INDEPENDENCE  AND  THE  MEDIAN  LEMMA 


117 


It  is  important  to  understand  the  type  of  independence  required.  Notice,  in  particular,  that  the 
proof  does  rot  not  require  that  the  samples  on  which  the  m  estimates  a*  are  based  be  independent; 
we  orly  use  the  fact  the  events 

Et  =  {|c*i  -  a\  >  0} 

occur  with  probability  bounded  by  1/4  independent  of  the  previous  events  E{.  In  other  words,  we 
require  that  for  1  <  i  <  m  we  have 

Pr  {Ei\EuE2t...tEi-i}<l/A.  (B.l) 

The  law  of  conditional  probability  then  gives  Lemma  3.1. 

So  the  use  of  correlated  camples  in  the  estimator  C*  does  not  immediately  preclude  the  appli¬ 
cation  of  the  Median  Lemma  with  repeated  trials  of  that  estimator.  We  only  need  to  guarantee 
that  this  independence  condition  holds  for  the  separate  trials.  The  following  algorithm  does  this, 
summarising  the  method  suggested  by  the  discussion  here  and  in  the  previous  section. 

Algorithm  B.5  (Estimation  from  the  Near- Stationary  Chain)  Let  T(e)  be  a  convergence 
guarantee  for  the  reversible  ergodic  Markov  chain  P.  Adopting  the  remainder  of  the  setup  of 
Chapter  3,  consider  the  following  procedure.  Let  c  be  any  nonnegative  real  number. 

Start  the  chain  in  any  initial  state. 

for  1  <  t  <  m  do  begin 

Run  the  chain  P  for  X(e)  steps. 

Starting  with  this  sample  Xo,  compute  based  on  n  =  r(8rM~4)h2(l+c)//?2l  samples 
drawn  t  steps  apart,  and  call  the  resulting  estimate  ctj. 

Compute  the  median  M  of  the  m  values  a»,  1  <  i  <  m.  Output  M. 

Theorem  B.6  The  estimate  M  produced  by  Algorithm  B.5  satisfies 

Pi{\M-h1\>0}<  2-m. 

Proof:  (Outline)  Corollary  B.4  provides  a  bound  of  1/4  on  the  probability  of  each  event  Ei  = 
{|a<-a|  >  0}.  Combining  the  Markov  property  (1.1)  with  the  definition  of  a  convergence  guarantee, 
it  is  straightforward  to  arrive  at  the  independence  condition  (B.l)  between  the  events  Ei.  ES 


Appendix  C 

Enumerating  Contingency  Tables 


This  appendix  provides  a  brief  review  of  some  methods  for  counting  Ere  *nd  some  interesting 
combinatorial  interpretations  of  Erc-  We  discuss  various  formulas  for  counting  the  set  using  groups, 
symmetric  functions,  recurrences,  and  generating  functions.  We  prove  the  #P-completeness  of  a 
similar  set,  suggesting  that  the  problem  of  counting  Ere  i*  hard.  (The  complexity  of  counting  Erc> 
however,  remains  unknown.)  In  the  last  section,  we  outline  how  the  techniques  of  Chapter  5  can  be 
extended  to  apply  to  approximately  counting  Erc- 

Recall  from  Chapter  5,  our  usual  interpretations  of  the  notations:  r,  c,  N,  Erc»  and  n.  In 
addition,  we  use  the  notation  3frc  in  this  appendix  to  denote  the  cardinality  |Ercl- 

C.l  Classical  Counting  Approaches 

C.1.1  Exhaustive  Enumeration 

Though  Ere  is  in  general  very  large,  one  can  sometimes  afford  to  enumerate  its  elements  exhaustively. 
This,  of  course,  affords  a  means  of  exactly  calculating  the  cardinality  Mr c  **  well  as  various  other 
functionals  on  the  set,  such  as  mean  values  of  functions  and  significance  levels  under  the  uniform 
distribution.  A  number  of  authors  [Mar72]  [BW73]  [Han74]  have  suggested  exhaustive  algorithms. 
These  work  by  stepping  through  the  set  Erc»  one  table  at  a  time.  They  begin  at  some  canonically 
constructed  initial  table,  and  proceed  by  making  small  changes  to  the  cell  entries,  so  that  the  tables 
increase  monotonically  in  some  linear  ordering.  We  use  such  an  algorithm  to  compute  exact  results 
for  tome  of  our  experimental  comparisons. 

For  calculating  the  exact  significance  of  a  table  under  the  hypergeometric  model,  Pagano  and 
Taylor-Halvorsen  [PTH81]  have  proposed  a  shortcut  that  typically  avoids  generating  every  table, 
but  in  the  worst  case  may  still  be  exhaustive.  They  make  use  of  the  structure  of  the  hypergeomctric 
distribution,  and  their  trick  does  not  apply  under  the  uniform  model.  (The  hyper  geometric  or 


118 


C.l.  CLASSICAL  COUNTING  APPROACHES 


119 


Fisher- Yates  model  can  be  viewed  as  based  on  the  following  sampling  scheme.  Imagine  a  population 
of  exactly  Ar  members,  where  for  each  jt  1  <  j  <  n  there  are  c;  elements  of  type  j.  For  each  i, 
1  <  *  <  m,  fill  row  r%  by  sampling  uniformly  independently  without  replacement  from  among  the 
population,  and  if  an  element  of  type  j  is  picked,  place  it  in  column  j  of  row  i .  This  is  different  from 
the  multinomial  model  in  which  we  sample  replacement  from  a  population  in  which  a  fraction 
Cj/N  falls  in  category  ;.) 

C.l. 2  A  Recursive  Formula 

Gail  and  Mantel  [GM77]  give  a  straightforward  recursive  formulation  of  Afrc*  For  example,  for  the 
m  x  3  case,  they  give 

M(r1,r2,...,rm;c1,c2)  =  ^  A/(r3tr2,..  -  *i,c2  -  *2), 

kt.k, 

where  M(ru%  2, . .  .,rw,ci,c2)  denotes  the  number  of  tables  with  the  prescribed  row  sums  rit  1  < 
i  <  m,  and  the  prescribed  column  sums  Cj,  1  <  i  <  n;  the  third  column  sum  is  implicit,  since  it  is 
determined  by  c3  =  A'  -  ci  +  c2.  The  sum  on  the  right  runs  over  the  values  fci  and  fc2  with 

0  <  k  <  min(rm,Ci)  and  4*  *2  <  rm- 

While  this  might  seem  at  first  to  suggest  a  quick  way  to  compute  Afrc  in  general,  the  general¬ 
isation  of  this  recurrence  to  arbitrary  dimension  and  margins  takes  exponential  time  to  compute. 


C.1.3  Approximation  Formulas 

Diaconis  and  Efron  [DE85]  provide  the  following  approximation  for  Afrc  using  a  volume-times- 
density  argument.  They  do  not  provide  an  explicit  means  of  estimating  the  error.  Let 

1 


W  ~  1  +  mn/2N  ’ 

! 

1  “  w  VJVi  , 

n  = - K  tt*  for  1  <  i  <  m 

m  N 

1  — .  l»Cw 

Cj  =  — —  +  for  1  <  j  <  n 

and 

n  4*  1  1 

Then 

Afrc 


-m”  &'■)“( j>) 


4-1 


T(nk) 


(C.l) 


r(n)™r(Jt)» ' 

They  carefully  motivate  the  approximation  and  give  some  empirical  evidence  that  it  is  accurate.  We 
use  (C.l)  in  some  comparisons  in  Chapter  5. 


\ 


120 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


Good  [Goo76]  has  suggested  &  very  similar  approximation,  with  slightly  less  ‘tuning’  to  the 
parameteis.  Gail  and  Mantel  [GM77]  suggest  a  normal  approximation. 

Remark:  The  asymmetry  between  rows  and  columns  in  the  approximation  is  an  artifact  of  a 

procedural  choice  in  their  derivation,  namely  to  use  a  certain  distribution  on  the  column  margins 
conditioning  on  the  row  margins  rather  than  the  other  way  around.  D 


C.1.4  Tables  and  Group  Theory 

The  set  Erc  and  its  cardinality  Mrc  arise  in  a  variety  of  contexts  connected,  with  the  symmetric  group 
and  its  representations.  We  describe  some  examples  beiow  to  show  more  reasons  to  be  interested  in 
this  set,  and  also  to  suggest  alternative  algorithms  to  compute  the  site. 


Double  Cosets 

Given  a  partition  r  of  N,  let  Yr  be  the  subgroup  of  the  symmetric  group  Ss  consisting  of  all 
permutations  that  permute  the  first  n  elements  among  themselves,  the  next  rj  elements  among 
themselves,  and  *  -  on;  Yr  is  called  a  Young  subgroup.  It  is  isomorphic  to  the  direct  product 
of  the  symmetric  groups  S„.  If  Yr  and  Yc  axe  two  Young  subgroups  of  SN,  one  can  define  an 
equivalence  relation  ~  on  Ss  by 

t  ~  c  if  and  only  if  pi r/c  =  a,  some  p  €  Yr , » c  €  Ye. 

The  equivalence  classes  of  this  relation  are  called  the  double  cosets  of  IV  and  Yc  in  Ss-  It  is  a 
classical  fact  that  there  is  a  1-1  correspondence  between  the  double  cosets  and  the  elements  of  Ere- 
For  a  group-theoretic  proof,  the  reader  is  referred  to  lames  and  Kerber  [JK81,  Corollary  1.3.11]. 
The  correspondence  has  the  following  algorithmic  combinatorial  interpretation.  Consider  N  balls 
labelled  1  up  to  N.  Color  the  first  ti  balls  color  1,  the  next  r2  balls  color  2,  and  so  on.  Let  x  be 
a  permutation  in  S/t ,  viewed  as  an  arrangement  of  the  the  balls  by  their  numeric  labels.  Construct 
a  table  T(x)  as  follows:  Look  at  the  first  cj  places  in  x  and  for  each  color  i,  count  how  many  balls 
of  color  i  occur  in  these  places.  Call  these  numbers  T(r)i,i,  and  write  them  in  the  corresponding 
entries  along  column  1  of  the  table  T(x).  Then  look  at  the  next  cj  places  in  x  and  count  how  many 
K»ll«  of  each  color  *  occur,  calling  these  numbers  T(x)<. j,  writing  them  along  column  2  of  T(x). 
Continuing  in  this  way  produces  an  m  x  n  table,  T(x),  whose  ith  row  sum  is  r,  and  j'th  column  sum 
is  Cj.  Thus  T(x)  €  Ere-  It  is  not  hard  to  check  that  every  table  in  Ere  »  T(x)  for  some  x,  and 
that  T(x)  =  T(a)  if  and  only  if  x  and  o  are  in  the  same  double  coset.  Note  that  if  we  let  x  be  the 
identity  (or  any  fixed  permutation),  this  algorithm  also  gives  a  way  to  construct  an  element  of  Erc- 


C.J.  CLASSICAL  COUNTING  APPROACHES 


121 


Induced  Representations,  Tensor  Products 

Let  G  be  a  finite  group,  H  C  G  a  subgroup,  X  =  G/E  the  associated  coset  space,  and  L{X) 
the  vector  space  of  all  real- valued  functions  on  X.  The  group  G  acts  on  L[X)  by  left  translation: 
[s/](x)  =  f(s~lx).  This  action  gives  the  representation  of  G  induced  by  the  trivial  representative 
of  H  (see  [Ser77]),  denoted  Ind£(triv).  For  G  =  SN  with  Young  subgroup  H  =  Yr,  these  represen- 
tations  arise  directly  in  the  statistical  analysis  of  “partially  ranked  data  of  shape  r”  [Dia88]  [Dia89j. 

The  representations  Indy^(triv)  and  Indy"  (triv)  decompose  into  irreducible  representations.  The 
dimension  of  the  space  of  common  irreducible  components  is  called  the  intertwining  number  J(r,c). 
Formally,  this  is  the  dimension  of  the  space  of  linear  maps  from  Indy" (triv)  to  Indy" (triv)  which 
commute  with  the  action  of  the  group.  A  classical  theorem  of  Mackey  (see  [JK81,  p.  17])  states  that 

J(r,  c)  =  MTc  • 

Remark:  Here  is  a  related  fact.  If,  instead,  we  consider  Indf."(triv)  and  Indy" (alt),  where  ‘alt’ 
denotes  the  alternating  representation,  the  resulting  intertwining  number  is  equal  to  the  number  of 
0-1  matrices  with  row  sums  r  and  column  sums  c.  If  r  is  a  partition  of  N ,  and  e  =  rT,  the  partition 
obtained  by  transposing  the  Ferrers  diagram  of  r,  the  resulting  intertwining  number  is  1.  This  unique 
common  constituent,  Sc,  is  an  irreducible  representation  of  SN.  As  c  varies  over  all  partitions,  each 
irreducible  representation  occurs  as  Se  once  and  only  once.  James  and  Kerber  [JK81,  pp.  34-36] 
use  this  as  their  basic  construction  of  the  irreducible  representations  of  the  symmetric  group.  □ 

If  (pi.Vx),  (p2,  Vj)  are  linear  representations  of  G,  one  can  form  the  tensor  product  of  the  two 
representations.  It  is  the  set  of  matrices  p,(i)  ®  pj(s),  for  a  €  G,  and  gives  a  representation  of  G 
on  V,®V2.  This  is  a  basic  construction  of  representation  theory,  and  the  need  arises  »o  understand 
how  the  tensor  product  decomposes. 

If  lr  denotes  Indy£(triv),  then 

/r®/c  =  0/s, 

T 

where  the  direct  sum  ranges  over  all  tables  in  a  €  Erci  where  the  entries  of  the  table  are  linearly 
ordered  to  form  a  partition  of  N.  This  is  proved  and  discussed  by  James  and  Kerber  [JK81,  pp. 
95-98].  The  connection  between  tensor  products  and  double  cosets  is  a  special  case  of  a  more  general 
phenomenon  discussed  by  Curtis  and  Reiner  [CR81]. 

Symmetric  Functions 

A  polynomial  is  called  symmetric  if  it  is  invariant  under  every  permutation  of  its  variables.  Let  An 
be  the  ring  of  symmetric  polynomials  in  n  variables,  w»th  integer  coefficients.  This 

ring  decomposes  into  a  direct  sum  of  subrings 

An  =  ®  a£, 
k>  0 


122 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


where  A*  consists  of  the  homogeneous  symmetric  polynomials  of  degree  Jb,  together  with  the  sero 
polynomial.  The  reader  should  refer  to  MacDonald  [Mac79]  for  further  information  and  background 
on  material  appearing  throughout  this  section. 

There  arc  a  number  of  well-known  bases  for  the  symmetric  functions.  The  four  most  common 
ones  are:  the  monomial  symmetric  functions  mA,  the  elementary  symmetric  functions  eA, 
and  the  complete  symmetric  functions  Aaj  and  the  power  sum  symmetric  functions  pa- 
(Elements  of  these  bases  are  conveniently  indexed  by  partitions  A  of  n.) 

A  scalar  product  can  be  defined  on  An  by  requiring  that  the  h  and  m  bases  should  be  dual,  i.e. 

(AaK.)  =  6x,p, 

where  6  is  the  Kronecker  delta.  Moreover,  the  power  sum  basis  is  orthogonal  with  respect  to  this 
inner  product.  We  have 

(PaIPm)  =  ^a^a,  (C*2) 

where  if  a,  is  the  multiplicity  of  i  in  the  partition  A,  then  z\  = 

MacDonald  [Mac79,  Sec.  1-6]  shows  that  if  r  and  c  we  partitions  of  AT,  then 


(hr\he)  =  MTC. 

This  suggests  an  algorithm  for  computing  Afrc*  Given  r  and  c,  express  hr  and  hc  in  the  power-sum 
basis.  Multiply  in  this  basis  using  the  orthogonality  relation  C.2.  The  result  is  Afrc*  It  can  be 
shown  that  this  yields  an  algorithm. 

Example  C.l  (Magic  Squares)  Let  27n(r)  denote  the  number  of  n  x  n  matrices  each  of  whose  row 
and  column  sums  are  (the  same  value)  r.  Such  matrices  are  called  magic  squares,  and  were  first 
analysed  by  MacMahon  [Macl6].  This  is  a  special  case  of  determining  AfPc-  Stanley  [Sta73]  [Sta86, 
Prop.  4.6.19,  p.  232]  has  shown  that  ifn(r)  is  a  polynomial  in  r  of  degree  exactly  (n  -  l)2.  This 
polynomial  is  known  for  n  <  6.  Being  a  polynomial  of  degree  (n  —  l)2,  Hn(r)  is  determined  by  its 
values  on  (n  -  l)2  +  1  points.  Stanley  showed  that 


ifn(-l)  =  Hn(- 2)  =  •  •  •  =  Hn(-n  +  1)  =  0, 

and 

Hn(—n  —  r)  =  (“l)w~lif»(r). 

So,  for  example,  if  we  calculate  for  1  <  i  <  f1,1),  we  can  determine  the  polynomial  JET»(r). 

The  coefficients  of  -ETs(r)  and  l^r)  found  in  this  manner  by  Jackson  and  Van  Rees  [JR75]  using 
symmetric  function  techniques.  It  may  now  be  feasible  to  find  J7t(»*)*  The  high  order  coefficient 
of  ff»(r)  is  also  interesting  for  another  reason;  it  is  the  volume  of  the  polytope  of  n  x  n  doubly 
stochastic  matrices.  [Sta86].  D 


C.l.  CLASSICAL  COUNTING  APPROACHES 


123 


Kostka  Numbers,  RSSK  Correspondence 

There  are  other  natural  bases  of  the  symmetric  functions  in  which  to  try  working.  For  example,  the 
the  Schur  functions  3\  are  characterized  by  being  orthonormal  in  the  inner  product  defined  in  the 
previous  section.  This  means 

The  coefficients  giving  the  change  of  basis  from  the  Schur  functions  to  the  complete  symmetric 
functions  hx  are  called  Kostka  numbers.  In  other  words,  the  Kostka  numbers  Ka,m  are  the  unique 
numbers  satisfying  the  relation 

hx  =  ^ 

*» 

Macdonald  [Mac79,  p.57]  gives  a  straightforward  proof  from  these  definitions  that 

Mrc  =  ^  V.  Kfi.rK/tX' 

M 

D.  E.  Knuth  extended  ideas  of  Robinson,  Schensted,  and  Schutsenberger  to  obtain  an  elegant 
combinatorial  interpretation  of  this  identity  in  terms  of  an  explicit  constructive  correspondence  based 
on  Young  tableaux.  For  A  and  p  partitions  of  n,  a  semi- standard  Young  tableau  of  shape  A 
and  content  p  is  a  Young  tableau  of  shape  A  containing  pi  l’s,  P2  2’s,  and  in  general  p,  i’s.  The 
contents  must  be  arranged  in  the  cells  of  the  tableaux  in  nondecreasing  order  (left  to  right)  along 
the  rows  and  strictly  decreasing  order  (down)  each  column.  The  Kostka  number  is  equal  to  the 
number  of  such  tableaux.  The  RSSK  correspondence  gives  a  constructive  mapping  between  tables 
with  row  sums  r  and  column  sums  c  e»>d  pairs  of  semi-standard  Young  tableaux  of  contents  r  and 
c.  (See  [Knu68,  Vol.  3]  and  [Knu70].) 

It  may  thus  be  possible  to  generate  uniform  random  tables  in  Ere  by  generating  pairs  of  semi¬ 
standard  Young  tableaux  in  the  uniform  distribution.  This  still  seems  hard.  A  random  walk  on 
pairs  of  tableaux  generated  by  some  Bernoulli  Laplace-like  process  is  conceivable,  and  the  coupling 
techniques  used  in  Chapter  2  might  be  of  some  utility. 

C.l. 5  Counting  with  Generating  Functions 

Let  and  . . y»  be  variables.  Form  the  generating  function 

JI(l-*<yj)_1  =  (1  +  XiUi  +  (*iJ/i)J  4 - )(1  +  *i!h  +  (*tVz)J  ■+  ) 

...(l  +  Xmlk  +  femViO3  +  •••)•  (C3) 


One  can  verify  that  the  coefficient  of 


124 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


is  exactly  the  number  of  m  x  n  tables  with  row  sums  t  and  column  sums  c. 

For  example,  the  coefficient  of  *i*:ViJ/2V3  is  3,  which  means  that  for  r  =  (2, 1)  and  c  —  (1, 1, 1), 
the  value  of  Mrc  =  3.  In  fact,  one  can  check  that  the  three  tables  are 


The  coefficients  in  the  generating  function  can  be  expressed  as  contour  integrals  involving  the 
generating  function  and  this  leads  to  some  asymptotic  approximations.  Good  [Goo76]  records  some 
efforts  in  this  direction. 

Along  more  algorithmic  lines,  we  can  compute  initial  portions  of  such  generating  functions  by 
multiplying  polynomials.  This  does  not  give  polynomial-time  algorithms,  but  this  method  may  be 
feasible  for  small  m,  n,  and  N.  We  outline  this  method  here.  We  will  use  s  to  denote  (iu  i2>  •  •  • »  u)» 
and  write  a  term  z\xz%  •  •  as  z*.  We  may  then  write  a  polynomial  as  f(zu  z2t . . . ,  zk)  =  f{z)  = 

Two  k  variable  polynomials  of  degree  at  most  d  in  each  variable  can  be  multiplied  using  fast 
Fourier  transforms  on  the  group  in  0[k{2d)k  lg  d)  arithmetic  operations.  To  see  this,  suppose 
f(z)  =  a^z*  and  g(z)  =  £;  &***  m  two  polynomials.  Then  their  product  is  h(z)  =  c3z8 
where 

Cs-  aibi' 

i  +  3  =  * 

is  a  convolution.  This  convolution  can  be  computed  by  multiplying  two  Fourier  transforms  on  the 
group  Z$i.  Standard  FFT  algorithms  on  this  group  of  order  (2d)*  can  be  used  to  compute  the 
transforms,  inverse-transform,  and  the  pointwise  multiplications  in  the  stated  0((2d)*  lg[(2d)  ])  — 
0(k(2d)k  lgd)  arithmetic  operations.  (See  for  example  [CLR90,  Chap.  32].) 

Let  d.  =  max({r<  |  *  €  [m]>  U  {c,  |  j  €  [n]})  be  the  maximum  among  all  of  the  row  and  column 
sums  Ti  and  c,.  Since  degrees  increase  monotonically  by  addition  when  we  multiply  polynomials, 
and  the  term  whose  coefficient  is  Mrc  in  the  generating  function  (C.3)  has  no  variable  of  degree 
exceeding  d. ,  we  can  eliminate  from  consideration  any  term  with  a  variable  whose  order  exceeds  d. . 
Successively  multiplying  each  of  the  mn  polynomials,  and  discarding  terms  of  excess  degree,  gives 
an  initial  segment  of  the  generating  function  that  is  accurate  in  the  coefficient  we  need.  The  value 
of  Mrc  can  be  computed  in  this  way  using  0(mn(m  +  n)(2d.)m+nlgd.)  arithmetic  operations  on 
0((2d.)k)  numbers. 

When  this  technique  is  feasible,  it  can  also  be  applied  to  situations  in  which  it  is  desired  to  count 
the  elements  T  of  Ere  satisfying  additional  linear  constraints  of  the  form 

The  basic  generating  function  for  tables  can  then  be  augmented  to 

JJ(1  -  S^SjJ/j)-1, 


C.2.  HARDNESS  OF  A  RELATED  PROBLEM 


125 


so  that  the  coefficient  of  saxryc  is  the  number  of  tables  in  Ere  satisfying  the  additional  constraint. 
Additional  variables  can  be  used  to  add  more  constraints. 

For  example,  the  generating  function  for  “n  x  n  magic  squares  with  diagonal  constraints”  can  be 
expressed  with  two  additional  variables  s  and  t  as 

where  Oi}  =  1  precisely  if  i  =  j,  btJ  =  1  precisely  when  i  +  j  =  n,  and  both  are  *ero  otherwise. 

Such  additional  constraints  arise  naturally  in  various  types  of  contingency  table  inference.  For 
example,  Agresti,  Mehta,  and  Patel  [AMP90]  needed  the  number  of  tables  with  prescribed  row  and 
column  sums  and  one  additional  global  constraint  where  f°r  prescribed  values  of  u*  and  t^. 

Remark:  A  different  approach  involving  generating  functions  is  discussed  by  Stanley  [Sta86, 

Chapter  4],  who  suggests  an  approach  to  counting  the  integer  lattice  points  within  a  convex  poly  tope 
with  rational  vertices.  Determining  Mrc  may  be  viewed  as  such  a  counting  problem.  (See  our 
Sections  5.3  and  5.5.)  In  particular,  Stanley  uses  his  method  to  prove  some  results  about  the 
number  of  magic  squares  (See  our  Examples  5.21  and  C.l).  This  approach  may  yield  similar  results 
for  more  general  instances  of  Ere-  E 

C.2  Hardness  of  a  Related  Problem 

We  have  not  been  able  to  determine  the  complexity  of  the  counting  problem  for  Ere-  However,  in 
this  section  we  show  that  a  related  problem  is  provably  difficult. 

Let  E  be  a  finite  alphabet.  If  *  €  E‘,  we  use  |x|  to  denote  the  length  of  *.  If  L  C  E",  and  w  is 
a  given  word  in  E" ,  define  the  extension  language 

(L|to)  =  {x|wx  €  L). 

In  other  words  (L|to)  is  the  set  of  extensions  x  of  u>,  that  result  in  a  word  uix  in  L. 

Let  R  C  E"  x  E"  be  a  relation  on  words.  We  write  R{x,  y)  to  denote  (x,  y)  €  R,  and  R{x)  to 
denote  the  set  {y  |  R(x,y)}.  R  is  a  p-relation  if  there  are  polynomials  p  and  q  such  that 

•  the  predicate  R(x,  y)  can  be  checked  in  deterministic  time  p(|z|). 

•  if  y  €  £(*)  then  |y|  <  q( |*|). 

For  a  relation  R(x,  y),  let  JZ(x)  =  {y  |  fJ(x,y)}.  We  call  B(x)  the  solution  set  of  R  given  s. 
The  class  NP  can  be  defined  as  the  class  of  sets  that  can  be  represented  {x  |  H(x)  ^  0}  for  some 
p-relation  R.  If  R  is  a  p-relation,  the  question  “Given  x,  is  there  a  y  such  that  R(x,y)?”  is  called 
the  decision  problem  for  the  relation.  (For  further  background  see  [GJ79].) 


126 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


Similarly,  one  may  pose  the  counting  problem  for  a  p-relation  R  by  asking  “Given  x,  how 
many  y  are  there  such  that  R(zt  y)?”  The  class  #P  is  the  class  of  counting  problems  for  p-relations, 
or,  more  formally,  the  class  of  nonnegative  integer  functions  /  over  E*  such  that  f(z)  =  |i2(s)|  for 
some  p-relation  R. 

There  are  elements  /  in  #P  such  that  the  existence  of  a  polynomial  time  algorithm  that  computes 
/  would  imply  the  existence  of  polynomial  time  algorithms  for  each  function  in  #P.  Such  a  counting 
problem  /  is  called  #P-complete;  these  problems  are  in  a  sense  the  hardest  in  #P.  We  now  show 
that  the  counting  problem  for  a  certain  family  of  sets  related  to  Ere  is  #P-complete. 

For  a  set  of  matrix  coordinates  Z  C  [m]  x  [n],  the  set  E^c  of  contingency-tables  with  struc¬ 
tural  zeros  at  Z,  is  the  subset  of  Ere  in  which  every  table  has  only  seros  at  the  entries  (»,;),  for 

(»,;)€  r. 

Theorem  C,2  With  r,  c,  Z  as  parameters,  the  counting  problem  for  Efc  is  #P-complete .  This 
holds  even  if  the  inputs  are  expressed  in  unary. 

Proof:  First  note  that  the  counting  function  for  the  set  Efc  is  clearly  in  #P.  To  prove  completness, 
we  give  a  reduction  from  the  problem  of  computing  the  permanent.  Given  annxn  matrix  A  with 
0-1  entries,  the  permanent  of  A  is  the  number  of  perfect  matchings  in  the  bipartite  graph  with 
vertex  sets  [n]  and  [n]  and  adjacency  matrix  A .  The  problem  of  computing  the  permanent  was 
shown  to  be  #P-complete  by  Valiant  [Val79]. 

Computing  the  permanent  is  just  a  special  case  of  computing  |EyC|.  Suppose  we  are  given  the 
n  x  n  adjacency  matrix  A  for  a  bipartite  graph  G .  Let  r  =  c  =  (1, 1, 1)  (of  dimension  n). 
Construct  the  set  Z  of  pairs  (i,j)  for  which  Aij  =  0.  Now  it  is  easy  to  see  that  a  table  T  is  in  EpC  if 
and  only  if  T  is  the  adjacency  matrix  of  a  perfect  matching  in  G .  Thus  |Efc|  is  equal  to  the  number 
of  perfect  matchings  in  G .  The  described  reduction  can  clearly  be  done  in  log-space.  This  proves 
the  theorem.  @ 

We  conjecture  that  the  counting  problem  for  Ere  is  also  #P-complete,  even  with  parameters  in 
unary.  That  is,  we  think  it  is  unlikely  that  there  is  an  algorithm  that  exactly  counts  Ere  in  time 
polynomial  in  m,  n,  and  N. 

Remark:  The  set  S£c,  like  the  set  Ere,  arises  naturally  in  the  analysis  of  contingency  tables 
where  the  row  and  column  dassications  that  gave  rise  to  the  table  preclude  certain  combinations. 
For  example,  suppose  that  from  a  certain  study  we  construct  a  table  that  classifies  subjects  by 
sex  along  the  rows,  and  by  the  incidence  of  various  cancers  along  the  columns.  Then  a  table  entry 
representing  the  number  of  males  with  uterine  cancer  would  necessarily  contain  a  sero.  (Statisticians 
call  such  a  sero  structural.)  □ 


C.3.  APPROXIMATE  COUNTING  USING  SAMPLING 


127 


C.3  Approximate  Counting  using  Sampling 

The  techniques  of  Chapter  5  can  be  applied  to  do  approximate  counting  of  Ere-  la  an  earlier  remark 
(on  page  104),  we  noted  that  the  techniques  of  Dyer,  Frieze,  and  Kannan  (DFK69]  can  be  applied  to 
approximately  count  Ere-  In  this  section,  we  show  how  our  own  sampling  methods  from  Chapter  5 
provide  a  means  to  do  approximate  counting,  although  we  provide  no  polynomial-time  guarantees. 

The  problems  of  counting  a  set  and  uniformly  sampling  from  that  set  are  closely  related.  This 
relation  has  been  made  precise  in  a  complexity  theoretic  sense  by  Jerrum,  Valiant,  and  Vaziram 
in  [JVV86].  In  order  to  keep  this  work  essentially  self-contained,  we  will  present  a  brief  synopsis  of 
their  main  idea  here.  However,  to  get  a  complete  understanding  of  what  we  say  here,  the  reader 
should  be  familiar  with  their  ideas  already  or  study  their  paper  concurrently. 

To  clarify  the  setting,  first  consider  the  following  simplified  situation.  Suppose  one  has  a  set  V 
and  subset  H  C  V,  where  |2f  |  is  known,  but  |V|  is  not.  If  we  could  get  a  good  approximation  to 
p  =  |J7|/|V|,  then  we  could  approximate  |V|  =  \H\/p.  This  p  is  the  mean  value  of  the  indicator 
function  of  the  subset  H  under  the  uniform  distribution  on  V.  So  we  could  approximate  p  by 
sampling  (we  presented  an  efficient  means  of  doing  this  in  Chapter  3),  and  provided  p  is  not  too 
small,  the  number  of  samples  required  to  get  a  reasonable  estimate  will  not  be  excessive. 

Similarly,  suppose  we  can  find  in  V  a  sequence  of  nested  subsets  V  =  Hi  D  Ht  D  *  *  *  D  Hm 
where  |J7m|  =  1  or  some  other  known  value,  where  each  ratio  p<  =  [ff.-t-ii/l-S’.l  is  not  too  small. 
Then  using  the  fact  that  |V|  =  II^U/pO  we  can  approximate  \V\  using  approximations  for  the 
Pi.  We  will  work  with  sets  V  that  admit  such  a  decomposition,  where,  in  fact,  each  subset  Hi  in  the 
nested  sequence  is  just  a  ‘smaller  instance’  of  V . 

We  now  define  a  class  of  sets  that  admit  this  kind  of  decomposition  in  a  way  that  can  be  computed 
‘efficiently.’  Let  £  be  a  finite  alphabet.  A  relation  R  C  £"  x  E"  is  (polynomially)  self-reducible 
if 

•  There  is  a  deterministic  polynomial-time  computable  function  g  :  £*  — ♦  N  such  that  if  i£(x,  y) 
then  |y|  =  g(x). 

•  There  exists  polynomial- time  computable  functions  ^  :  (£")*  ^  N  such 

that  for  all  z,w  € 

<r(s)  <  clg  |*|,  for  some  constant  c  , 

y(x)  >  0  implies  a(x)  >  0, 

ond|V»(*,t^)|  <  l*l> 

and  such  that  for  all  x  €  £* ,  if  y  =  tuz  with  |y|  =  g[x)  and  \w\  =  *(*)  then 

£(*,102)  if  and  only  if  R(\li(z,w),z). 


128 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


This  tells  us  that  J2(x,tu2)  can  be  determined  by  first  computing  and  then  determining 

R(^p(x)  w),z).  We  can  also  write  this  last  condition  as: 

(fl(x)|u>)  =  ttf)). 

That  is,  each  extension  language  (i2(x)|ti>)  can  be  expressed  in  terms  of  the  original  relation  R 
on  some  smaller  instance  V»(x,tu).  Since  |iu|  s=  <r(x)  =  0(ln  1*1)  the  entire  solution  set  J?(x)  can  be 
expressed  as  the  disjoint  union  of  a  polynomial  number  of  solution  sets  of  the  same  relation  on  smaller 
instances.  Many  interesting  sets  can  be  expressed  as  the  solution  sets  of  self- reducible  p-relations. 
(For  the  definition  of  a  p-relation,  see  Section  C.2.)  We  give  one  example  below.  Many  others  exist, 
including,  for  example,  the  perfect  matchings  in  a  bipartite  graph,  satisfying  assignments  of  CNF 
or  DNF  formulae,  and  independent  sets  of  given  sire  in  a  graph. 

Example  C.3  (Spanning  Trees)  The  relation  -R(x,y)  =“y  is  a  spanning  tree  of  the  graph  zn 
is  a  self-reducible  p-relation.  Let  x  represent  a  graph  with  n  nodes,  (we  will  consider  z  to  be  an 
adjacency  list,  where  each  node  index  is  a  /  =  flgn|  bit  integer).  We  will  represent  spanning  trees 
as  lists  of  n  —  1  edges,  each  edge  being  a  pair  of  node  indices.  So  every  solution  y  has  length 
\y\  =  g(x)  ss  2 l(n  -  1).  Let  c(z)  =  2£,  so  that  the  first  er(z)  characters  of  a  solution  y  will  represent 
the  first  edge  in  the  list.  For  a  string  y  of  length  ^(x),  write  y  =  wz,  where  |ta|  =  <r(z)  and  w 
represents  one  edge.  Let  ^(x,id)  represent  the  result  of  contracting  the  edge  w  in  x,  (merging  the 
vertices  at  the  two  ends  of  w  and  erasing  any  resulting  multiple  edges).  This  yields  a  smaller  graph 
|t£(x,  w)\  <  2t[n  —  2)  <  |x|.  Now  note  that  J*(x,u/2)  if  and  only  if  fl(^(x,  u;),2).  That  is,  y  =  wz  is 
a  spanning  tree  of  x  if  and  only  if,  when  we  ‘contract*  the  edge  w  in  x,  z  represents  a  spanning  tree 
of  the  resulting  ^(*1  ^)*  We  can  also  write  (i£(x)|u>)  =  J£(V>(x,  to)).  D 

Using  essentially  the  same  reasoning  as  we  outlined  in  the  early  paragraphs  of  this  section, 
Jerrum,  Valiant,  and  Vaxirani  show  that  if  we  can  efficiently  do  near-uniform  sampling  in  these  sets 
then  we  can  efficiently  count  these  sets  to  within  small  relative  error.  In  particular  they  give  the 
following  theorem. 

Theorem  C.4  (Counting  using  Sampling) (Adapted  from  [JW86])  Let  R  be  a  self-reducible 
p-relation  that  is  also  in  P.  Suppose  that  we  have  an  algorithm  that 

•  takes  input  x  €  2*,  and  c,  0  <  c  <  1, 

•  runs  in  time  polynomial  in  |x|  and  ln(l/e), 

•  and  generates  a  random  sample  y  €  Jt(x)  whose  distribution  is  within  ratio  1+e  of  the  uniform 
distribution . 

Then ,  given  an  z,  we  can  compute  a  count  C  such  that  C  approximates  |f2(x)|  within  ratio  1  +  e 
with  probability  at  least  1  -6.  Moreover ,  this  computation  can  be  done  in  time  polynomial  in  \z\, 
1/e  and  ln(l/6). 


C.3.  APPROXIMATE  COUNTING  USING  SAMPLING 


129 


Remark:  The  requirement  that  the  relation  R  is  in  P  can  be  dropped  if  a  slightly  different  notion  of 
near-uniform  sampling  is  used  of  if  we  are  only  concerned  with  inputs  x  for  which  R{x)  is  nonempty. 
See  [JVV83].  □ 

We  will  now  show  that  we  can  cast  Ere  in  a  self-reducible  form.  For  ary  table  F  €  Ere  and 
ordered  pair  (t,  l)  €  [m]  x  [n],  let  [Erc|F;  (*>  0]  denote  the  set  of  all  tables  T  in  Ere  with 

Tij  =  Fi}  whenever  i  <  k  and  whenever  »  =  fc  and  j  <  l. 

In  other  words,  [Erc|F;  (*.  0)  is  the  subset  of  Erc  tables  whose  entries  match  the  table  F  in  all 
positions  strictly  preceding  (Jb,  l)  in  the  lexicographic  order.  (We  will  use  the  symbols  V  (and  >) 
to  denote  this  order  relation.)  Notice  that  we  have  the  following  properties  for  any  F  €  Erc:  (*) 
F  €  [Erc|F;(M]>  (b)  Prcl-F;  (1, 1)]  =  Erc,  and  (c)  for  any  (*,/)  v  (m  -  l,n-  1)  we  have 
[Erc|F;  (Jb,/)]  =  {F},  since  all  remaining  entries  are  then  determined  by  the  sum  constraints. 

If  we  use  Fti-i  to  denote  the  table  obtained  from  F  by  setting  Fti  =  t,  then  we  may  write 

[£re|F;(M)]  =  U  Prc|Fw_<;succ(M)). 

0<i<K 

where  succ(Jfc,  J)  is  the  ordered  pair  that  is  the  immediate  successor  of  (k}  l)  in  the  lexicographic  order. 
Note  that  some  of  the  sets  on  the  right  may  be  empty.  This  relation  expresses  the  decomposition 
needed  to  show  that  [Srcl^;  (*,  01  »  polynomial^  self-reducible  in  the  parameters  in  m,  n,  and  N. 

Now  we  will  show  that  we  can  use  essentially  the  same  random  walk  to  draw  samples  from 
[£rc|F;  (Jb,  /)]  as  we  used  to  draw  samples  from  Erc* 

Algorithm  C.5  (Random  walk  on  [Ercl^;  (*> 01)  For  a  8*ven  k  and  modify  Algorithm  5.5  so 
that  in  steps  2  and  3,  it  chooses  only  amongst  values  of  ii,  »2,  ji  and  J2  such  that  (*i,  ji)  £  (M)* 
That  is  replace  those  steps  with  the  following. 

2',  3'  Uniformly  choose  a  pair  of  rows  ii  and  »2>  <  '2  ^  m,  and  a  pair  of  columns  j\  and  j2» 

ii  <  J*2  <  n,  such  that  (iiJi)  (M)- 

Theorem  C.8  The  random  walk  generated  by  the  Algorithm  C.5 ,  and  started  on  any  F  £  Erc  « 
trgodic  and  has  uniform  stationary  distribution  on  [Erci*F>  (k»  01* 

Proof:  The  walk  always  remains  within  (EPcl*\M)  »nce  no  entries  in  positions  preceding  (M) 
are  ever  altered.  That  the  walk  is  irreducible  on  [Srcl^;(M)]  follows  from  the  second  clause 
of  Lemma  5.3,  which  says,  in  effect,  that  there  is  a  path  between  any  two  tables  A  and  B  in 
[Erc;  F*  (lb, /)]  does  not  involve  coordinates  lexicographically  less  than  (ktl).  The  arguments  for 
symmetry  and  aperiodicity  are  the  same  as  for  the  basic  walk.  9 

As  before,  we  don't  know  that  the  walk  converges  in  polynomial-time.  However,  under  the  sup¬ 
position  that  it  does,  we  have  the  next  theorem 


A- 


130 


APPENDIX  C.  ENUMERATING  CONTINGENCY  TABLES 


Theorem  C.7  Suppose  there  exists  a  convergence  guarantee  for  the  random  walk  of  Algorithm  C,5 
that  is  polynomial  in  m,  n,  N ,  and  l/«.  Then  there  is  an  algorithm  that 

1.  takes  inputs  r,  c}  €  and  6 

2.  runs  in  time  polynomial  in  m,  n,  Nt  l/ef  and  ln(l/<5), 

3.  produces  an  approximate  count  C  that  is  within  ratio  e  of  Afrc  with  probability  at  least  1  —  6. 

Proof:  Since  this  set  is  polynomial^  self- reducible  in  the  parameters  m,n,  and  N ,  Theorem  C.4 
applies.  Theorem  C.6  combined  with  the  hypothesised  convergence  guarantee  provides  the  necessary 
means  of  near- uniform  sampling  from  [Erc|<F;  (*t  *)]•  To  estimate  3frc.  we  estimate  the  cardinality 
of  Ere  =  Prcl-F;  (*»  *)]>  where  F  l$  aDy  ta^e  £rc-  H 

Remark:  Traditional  methods  of  uniform  sampling  from  a  combinatorial  set  V  rely  on  formulas  for 
|V|,  or  implicitly  suggest  formulas  for  | V').  But  in  this  setting  \V\  is  exactly  what  we  don’t  know  and 
wish  to  know.  Broder  [Bro86]  made  the  valuable  observation  that  sampling  using  a  Markov  chain 
avoids  this  problem.  A  number  of  other  authors  (e.g.  [JS88,  DFK89])  have  since  used  Markov  chains 
to  do  sampling  for  approximate  counting.  Note  that  the  results  of  Chapter  3  on  approximating 
mean  values  of  indicator  functions  can  be  applied  directly  to  improve  the  running  times  in  all  of 
these  approximate  counting  algorithms  as  well.  O 


Bibliography 


[AD86] 


D.  Aldous  and  P.  Diaconis.  Strong  uniform  times  and  finite  random  walks.  Technical 
Report  249,  Department  of  Statistics,  Stanford  University,  1986. 


[AGM87]  N.  Alon,  Z.  Galil,  and  V.  D.  Milman.  Better  expanders  and  superconcentrators.  Journal 
of  Algorithms,  8:337-347,  1987. 


[AKL+79]  R.  Aleliunas,  R.  M.  Karp,  R.  J.  Lipton,  L.  Lovasi,  and  C.  Rackoff.  Random  walks, 
universal  traversal  sequences,  and  the  complexity  of  mise  problems.  In  Proceedings  of 
the  20th  IEEE  Symposium  on  the  Foundations  of  Computer  Science  (FOCS),  1979. 


[AKS87] 

[Ald87] 

[Alo86] 
[AM  85] 

[AMP90] 

[Bab79] 

[Bab90] 

[Bac87] 


M.  Ajtai,  J.  Komlos,  and  E.  Szemeredi.  Deterministic  simulation  of  logspace.  In  Pro¬ 
ceedings  of  the  19th  ACM  Symposium  on  the  Theory  of  Computing  (STOC),  1987. 

D.  Aldous.  On  the  Markov  chain  simulation  method  for  uniform  combinatorial  dis¬ 
tributions  and  simulated  annealing.  Probability  in  the  Engineering  and  Informational 
Sciences ,  1:33-46,  1987. 

N.  Alon.  Eigenvalues  and  expanders.  Combinatorica,  6(2):83-96,  1986. 

N.  Alon  and  V.  D.  Milman.  Ai,  isoperimetric  inequalities  for  graphs,  and  superconcen¬ 
trators.  Journal  of  Combinatorial  Theory ,  Series  B ,  38:73-88,  1985. 

A.  Agresti,  C.  Mehta,  and  N.  Patel.  Exact  inference  for  contingency  tables  with  ordered 
categories.  Journal  of  the  American  Statistical  Association,  85:453-458,  1990. 

L.  Babai.  Spectra  of  Cayley  graphs.  Journal  of  Combinatorial  Theory,  Series  B ,  27:180- 
189,  1979. 

L.  Babai.  Local  expansion  of  vertex-transitive  graphs  and  random  generation  in  finite 
groups.  Technical  report,  Department  of  Computer  Science,  University  of  Chicago,  1990. 

E.  Bach.  Realistic  analysis  of  some  randomised  algorithms.  In  Proceedings  of  the  19th 
ACM  Symposium  on  the  Theory  of  Computing  (STOC),  1987. 


131 


132 


BIBLIOGRAPHY 


[Ban80]  C.  Bandie.  Isopcrimetric  Inequalities  and  Applications.  Pitman  Advanced  Publishing 
Program,  1980. 

[BD89j  D.  Bayer  and  P.  Diaconis.  Trailing  the  dovetail  shuffle  to  its  lair.  Technical  Report  329, 
Department  of  Statistics,  Stanford  University,  1989. 

[BFH75]  Y.  M.  M.  Bishop,  S.  E.  Fienberg,  and  P.  W.  Holland.  Discrete  Multivariate  Analysis: 
Theory  and  Practice .  MIT  Press,  Cambridge  Mass,  1975. 

[BGG90]  M.  Bellare,  O.  Goldreich,  and  S.  Goldwasser.  Randomness  in  interactive  proofs.  In  Pro¬ 
ceedings  of  the  Slsi  Annual  IEEE  Symposium  on  the  Foundations  of  Computer  Science 
( FOCS )t  pages  563-572,  1990. 

[Big74]  N.  L.  Biggs.  Algebraic  Graph  Theory.  Cambridge  Tracts  in  Mathematics,  No.  67.  Cam¬ 
bridge  University  Press,  1974.  Unfortunately,  this  wonderful  book  is  no  longer  in  print. 

[BK88]  A.  Z.  Broder  and  A.  Karlin.  Bounds  on  covering  times.  In  Proceedings  of  the  29th  IEEE 
Symposium  on  the  Foundations  of  Computer  Science  (FOCS),  1988. 

[BKL89]  L  Babai,  W.  M.  Kantor,  and  A.  Lubotsky.  Small-diameter  Cayley  graphs  for  finite  simple 
groups.  European  Journal  of  Combinatorics,  10:507-522,  1989. 

[BL84]  G.  Birkhoff  and  R.  E.  Lynch.  Numerical  Solution  of  Elliptic  Problems.  Number  6  in 
SIAM  Studies  in  Applied  Mathematics.  SIAM,  Philadelphia,  1984. 

[Bol86]  B.  Boilobas.  Combinatorics:  Set  Systems ,  Hypergraphs ,  Families  of  Vectors,  and  Com¬ 
binatorial  Probability.  Cambridge  University  Press,  1986. 

[BOP88]  J.  Baglivo,  D.  Olivier,  and  M.  Pagano.  Methods  for  the  analysis  of  contingency  ta¬ 
bles  with  large  and  small  cell  counts.  Journal  of  the  American  Statistical  Association, 
83(404):  1006-1013,  1988. 

[Bro]  A.  Z.  Broder.  Couplings  and  strong-uniform  times  for  the  hypercube.  Oral  communica¬ 
tion. 

[Bro86]  A.  Z.  Broder.  How  hard  is  it  to  marry  at  random?  (on  the  approximation  of  the 
permanent).  In  Proceedings  of  the  18th  ACM  Symposium  on  the  Theory  of  Computing , 
1986. 

[BW73]  D.  M.  Boulton  and  C.  S.  Wallace.  Occupancy  of  a  rectangular  array.  Computing , 
16(I):57 — 63,  1973. 

[Cal90]  J.  M.  Calvin.  Stochastic  Optimization  Algorithms  and  Moment  Formulas  for  Markov 
Chains .  PhD  thesis,  Stanford  University,  1990. 


BIBLIOGRAPHY 


133 


i 


[CDS50] 

[Che70] 

[Chu60] 

[CLR90] 

[CR81] 

[CW89] 

[DE85] 


[DF88] 

[DFK89] 

[Dia88] 

[Dia89] 

[DS81] 

[DS87] 


D.  M.  Cvetkovic,  M.  Doob,  and  H.  Sachs.  Spectra  of  Graphs:  Theory  and  Applications. 
Academic  Press,  New  York,  1980. 

J.  Cheeger.  A  lower  bound  for  the  smallest  eigenvalue  of  the  Laplacian.  In  Problems  in 
Analysis.  Princeton  University  Press,  New  Jersey,  1970. 

K.  L.  Chung.  Markov  Chains  with  Stationary  Transition  Probabilities.  Springer- Vetlag, 
Berlin,  1960. 

T.  Cormen,  C.  Leiserson,  and  R.  Rivest.  Introduction  to  Algorithms.  The  MIT  Electrical 
Engineering  and  Computer  Science  Series.  McGraw-Hill/The  MIT  Press,  1990. 

C.  W.  Curtis  and  I.  Reiner.  Methods  of  Representation  Theory,  with  applications  to 
finite  groups  and  orders.  Wiley  and  Sons,  New  York,  1981.  Two  volumes. 

A.  Cohen  and  A.  Wigderson.  Dispersers,  deterministic  amplification,  and  weak  random 
sources.  In  Proceedings  of  the  SOth  IEEE  Symposium  on  the  Foundations  of  Computer 
Science  (FOCS),  1989. 

P.  Diaconis  and  B.  Efron.  Testing  for  independence  in  a  two-way  table:  New  inter¬ 
pretations  of  the  chi-square  statistic.  The  Annals  of  Statistics,  13(3):845-874,  1985. 
Invited  paper.  Discusses  the  case  for  volume  tests  and  gives  asymptotic  approximations 
for  estimating  volume-based  significance  of  the  Chi-square  statistic;  some  opposing  and 
supporting  discussion  ensues  in  pages  following,  and  a  rejoinder  by  the  authors  appears 
on  page  905  of  the  same  volume. 

P.  Diaconis  and  J.  A.  Fill  Strong  stationary  times  via  a  new  form  of  duality.  Technical 
Report  305,  Department  of  Statistics,  Stanford  University,  1988. 

M.  Dyer,  A.  Friese,  and  R.  Kannan.  A  random  polynomial  time  algorithm  for  approxi¬ 
mating  the  volume  of  convex  bodies.  In  Proceedings  of  the  Slst  Annual  ACM  Symposium 
on  the  Theory  of  Computing  (STOCJ,  1989. 

P.  Diaconis.  Group  Representations  in  Probability  and  Statistics,  volume  11  of  Lecture 
Notes  -  Monograph  Series.  Institute  of  Mathematical  Statistics,  1988. 

P.  Diaconis.  A  generalisation  of  spectral  analysis  with  application  to  ranked  data.  The 
Annals  of  Statistics,  17(3):949-979, 1989.  Content  of  the  1987  Wald  Memorial  Lectures. 

P.  Diaconis  and  M.  Shahshahani.  Generating  a  random  permutation  with  random  trans¬ 
positions.  Z.  Wahrscheinlichkeiistheorie  verm.  Gebiete,  57;159- 179,  1981. 

P.  Diaconis  and  M.  Shahshahani.  Time  to  reach  stationarity  in  the  Bernoulli-Laplace 
diffusion  model.  SIAM  Journal  on  Mathematical  Analysis,  18(1):208-218,  January  1987. 


134 


BIBLIOGRAPHY 


[DS89] 

[EE07] 

[ER61] 

[Evc77] 

[Fel70] 

[Fie73] 

[Fil90] 

[Fla82] 

[Flo] 

[FOW85] 

[GC77] 

[GG81] 


P.  Diaconis  and  D.  Stroock.  Geometric  bounds  for  igenvalues  of  Markov  chains.  Tech¬ 
nical  Report  325,  Department  of  Statistics,  Stanford  University,  1989. 

P.  Ehrenfest  and  T.  Ehrenfest.  Ufcer  swei  bekannte  Einwande  gegen  das  Boltsmannsche 
H-Theorem.  Phys.  Zeitschrift,  8:311-314,  1907.  Also  see  Feller,  Volume  1,  p.  121  and 
pp.  377ff. 

P.  Erdos  and  A.  Renyi.  On  a  classical  problem  of  probability  theory.  Magyar  Tad.  Akad. 
Matemat.  Kutato  Intezet.  Kozl,  6:215-219,  1961.  Proves  a  limit  law  for  the  multiple 
coupon  collecting  problem  when  the  desired  number  of  complete  sets  of  coupons  is  fixed. 

B.  S.  Everitt.  The  Analysis  of  Contingency  Tables.  Monographs  on  Applied  Probability 
and  Statistics.  Chapman  and  Ball,  London,  1977. 

W.  Feller.  An  Introduction  to  Probability  Theory  and  its  Applications.  Wiley  and  Sons, 
1970.  Two  volumes:  3rd  (Vol.  1)  and  2nd  (Vol.  2)  revised  editions.  The  first  edition  was 
printed  in  1968. 

M.  Fiedler.  Algebraic  connectivity  of  graphs.  Czechoslovakian  Mathematics  Journal, 
23:298-305,  1973. 

J.  A.  Fill.  Eigenvalue  bounds  on  convergence  to  stationarity  for  non- reversible  Markov 
chains,  with  an  application  to  the  exclusion  process.  Technical  report,  The  Johns  Hopkins 
University,  Department  of  Mathematical  Sciences,  1990. 

L.  Flatto.  Limit  theorems  for  some  random  variables  associated  with  urn  models.  The 
Annals  of  Probability,  10(4):927-934,  1982.  Asymptotic  analysis  of  variables  associated 
with  multiple  coupon  collecting. 

R.  W.  Floyd.  Generating  random  samples  without  replacement,  as  sequences  and  sets. 
This  unpublished  note,  dated  April  1987,  contains  Floyd’s  elegant  random  subset  algo¬ 
rithm,  and  another  for  generating  a  random  permutation  of  m  elements  out  of  [n].  Their 
only  published  appearance  to  date  is  in  Jon  Bentley’s  column  “Programming  Pearls," 
Communications  of  the  ACM,  30(9):754-757,  September  1987. 

L.  Flatto,  A.  M.  Odlyrko,  and  D.B.  Wales.  Random  shuffles  and  group  representations. 
The  Annuls  of  Probability,  13(1):154-178,  1985. 

I.  J.  Good  and  J.  F.  Crook.  The  enumeration  of  arrays  and  a  generalisation  related  to 
contingency  tables.  Discrete  Mathematics,  19:23—65, 1977. 

O.  Gabber  and  Z.  Galil.  Explicit  constructions  of  linear-sised  superconcentrators.  Jour¬ 
nal  of  Computer  and  System  Sciences,  22:407—420, 1981. 


BIBLIOGRAPHY 


135 


[GI84] 

[GI87] 

[GJ79] 

[GL89] 

[GLS88] 

[GM77] 

[Goo76] 

[Gri75] 

[Gri78] 

[Hal82] 

[Han57] 

[Han74] 


P.  W.  Glynn  and  D.  L.  Iglehart.  Recursive  moment  formulas  for  regenerative  simula¬ 
tion.  In  J.  Janssen,  editor,  Proceedings  of  the  International  Symposium  on  Semi-Markov 
Processes  and  their  Applications,  Brussels ,  1984. 

P.  W.  Glynn  and  D.  L.  Iglehart.  A  joint  central  limit  theorem  for  the  cample  mean  and 
regenerative  variance  estimator.  Annals  of  Operations  Research ,  8,  1987. 

M.  Garey  and  D.  Johnson.  Computers  and  Intractability .  Freeman,  1979. 

G.  H.  Golub  and  C.  F.  Van  Loan.  Matrix  Computations.  Johns  Hopkins  University 
Press,  1989,  Second  edition. 

M.  Grotschel,  L.  Lovasz,  and  A.  Schriver.  Geometric  Algorithms  and  Combinatorial 
Optimization,  Spring'er-Verlag,  1988. 

M.  Gail  and  N.  Mantel.  Counting  the  number  of  r  x  c  contingency  tables  with  fixed 
margins.  Journal  of  the  American  Statistical  Association ,  72(360),  1977. 

I.  J.  Good.  On  the  application  of  symmetric  Dirichlet  distributions  and  their  mixtures 
to  contingency  tables.  Annals  of  Statistics,  4:1159-1189,  1976. 

D.  Griffeath.  A  maximal  coupling  for  Markov  chains.  Z .  Wahrscheinlichkeitstheorie 
verw.  Gebiete ,  31:95-106,  1975. 

D.  Griffeath.  Coupling  methods  for  Markov  processes.  In  Advances  in  Mathematics  Sup¬ 
plementary  Studies'  Studies  in  Probability  and  Ergodic  Theory  2 ,  pages  1-43.  Academic 
Press,  1978. 

Peter  Hall.  Rates  of  Convergence  in  the  Central  Limit  Theorem.  Number  62  in  Research 
Notes  in  Mathematics.  Pitman  Publishing,  Marshfield,  Mass.,  1982. 

E. J  Hannan.  The  variance  of  the  mean  of  a  stationary  process.  Royal  Statistical  Society 
Journal,  Series  B ,  19(2):282-5,  1957. 

T.  W.  Hancock.  Remark  on  algorithm  434.  Communication*  of  the  ACM ,  18:117-119, 
1974. 


[Hil90]  M.  Hildebrand.  Rates  of  Convergence  of  Some  Random  Processes  on  Finite  Groups. 
PhD  thesis,  Department  of  Mathematics,  Harvard  University,  1990. 

[Hun74]  T.  W.  Hungerford.  Algebra.  Number  73  in  Graduate  Texts  in  Mathematics.  Springer* 
Verlag,  New  York,  1974. 


136 


BIBLIOGRAPHY 


[Igl78] 

[ILL89] 

[IZ89] 

[JK77] 

[JK81] 

[JM85] 

[JR75] 

[JS88] 

[JS90] 

[JVV86] 

[Kar68] 


D.  L.  Iglehart.  The  regenerative  method  for  simulation  analysis.  In  Chandy  K.  M.  and 
R.  T.  Yeh,  editors,  Current  Trends  in  Programming  Methodology ,  volume  III:  Software 
Engineering.  Prentice- Hall,  New  Jersey,  1978. 

R.  Impagliazzo,  L.  Levin,  and  M.  Luby.  Pseudo-random  generation  from  one-way  func¬ 
tions.  In  Proceedings  of  the  21st  ACM  Symposium  on  the  Theory  of  Computing  (STOC)} 
pages  12-24,  1989. 

R.  Impagliazzo  and  D.  Zuckerman.  How  to  recycle  random  bits.  In  Proceedings  of  the 
SOth  IEEE  Symposium  on  the  Foundations  of  Computer  Science  (FOCS),  1989. 

N.  L.  Johnson  and  S.  Kotz.  Urn  Models  and  their  Application.  Wiley,  New  York,  1977. 

G.  D.  James  and  A.  Kerber.  The  Representation  Theory  of  the  Symmetric  Groups  vol¬ 
ume  16  of  Encyclopedia  of  Mathematics  and  its  Applications.  Addison- Wesley,  Reading, 
Mass.,  1981. 

S.  Jimbo  and  A.  Maruoka.  Expanders  obtained  from  affine  transformations.  In  Proceed¬ 
ings  of  the  nth  ACM  Symposium  on  the  Theory  of  Computing  ( STOC ),  pages  88-97, 
1985. 

D.  M.  Jackson  and  G.  H.  J.  Van  Rees.  The  enumeration  of  generalized  double  stochastic 
non-negative  integer  square  matrices.  SIAM  Journal  on  Computing ,  4:475-477,  1975. 
Gives  the  polynomials  H^(r)  and  for  counting  5x5  an d  6x6  ‘magic  squares’  with 
sums  r. 

M.  Jerrum  and  A.  Sinclair.  Approximating  the  permanent.  Technical  Report  CSR-275- 
86,  University  of  Edinburgh,  Dept,  of  Computer  Science,  1988. 

M.  Jerrum  and  A.  Sinclair.  Polynomial- time  approximation  algorithms  for  the  Ising 
model.  Technical  Report  CSR-1-90,  University  of  Edinburgh,  Dept,  of  Computer  Science, 
1990. 

M.  Jerrum,  L.  Valiant,  and  V.  Vaxirani.  Random  generation  of  combinatorial  structures 
from  a  uniform  distribution.  Theoretical  Computer  Science,  43:169-188,  1986. 

S.  Karlin.  A  First  Course  in  Stochastic  Processes.  Academic  Press,  New  York,  1968. 
Second  printing. 


[Knu68]  D.  E.  Knuth.  The  Art  of  Computer  Programming.  Addison- Wesley,  1968.  Three  volumes. 

Volume  1  first  appeared  1968.  The  material  on  Young  tableaux  appears  in  Volume  3 


BIBLIOGRAPHY 


137 


[Knu70] 

[KR88] 

[LPS86] 

[LS90] 

[Mac  16] 
[Mac79] 
[Mar  72] 
[Mat85] 

[MGB74] 

[MM64] 

[M063] 


D.  E.  Knuth.  Permutations,  matrices,  and  generalised  Young  tableaux.  Pacific  Journal 
of  Mathematics,  34:709-727, 1970. 

H.  Karloff  and  P.  Raghavan.  Randomised  algorithms  and  pseudo-random  numbers.  In 
Proceedings  of  the  20th  ACM  Symposium  on  the  Theory  of  Computing  (STOC),  1988. 

A.  Lubotsky,  R.  Phillips,  and  P.  Sarnak.  Explicit  expanders  and  the  Ramanujan  conjec¬ 
tures.  In  Proceedings  of  the  18th  ACM  Symposium  on  the  Theory  of  Computing  (STOC), 
pages  240-245,  1986. 

L.  Lovasr  and  M.  Simonovits.  The  mixing  rate  of  Markov  chains,  an  isoperimetric 
inequality,  and  computing  the  volume.  Technical  Report  Preprint  27 ,  The  Mathematical 
Institute  of  the  Hungarian  Academy  of  Sciences,  1990. 

P.  A.  MacMahon.  Combinatory  Analysis.  Cambridge  University  Press,  1916.  Reprinted 
in  1960  by  Chelsea,  New  York  as  one  volume. 

I.  G.  MacDonald.  Symmetric  Functions  and  Hall  Polynomials.  Clarendon  Press,  Oxford, 
1979. 

D.  L.  March.  Exact  probabilities  for  r  x  c  contingency  tables.  Communications  of  the 
ACM,  15:991-992, 1972. 

P.  Matthews.  Covering  Problems  for  Random  Walks  on  Spheres  and  Finite  Groups. 
PhD  thesis,  Depa-tment  of  Statistics,  Stanford  University,  1985.  Available  as  Technical 
Report  No.  234. 

A.  Mood,  F.  Graybill,  and  D.  Boes.  Introduction  to  the  Theory  of  Statistics.  McGraw 
Hill,  1974.  Third  Edition. 

M.  Marcus  and  H.  Mine.  A  Survey  of  Matrix  Theory  and  Matrix  Inequalities.  Allyn  and 
Bacon,  Boston,  1964. 

L.  E.  Moses  and  R.  V.  Oakford.  Tables  of  Random  Permutations.  Stanford  University 
Press,  1963. 


[MPS88]  C.  R.  Mehta,  N.  R.  Patel,  and  P.  Senchaudhuri.  Importance  sampling  for  estimating  exact 
probabilities  in  permutational  inference.  Journal  of  the  American  Statistical  Association, 
83(404):999-1005, 1988. 

[MRR+53]  N.  Metropolis,  A.  W.  Rosenbluth,  M.  N.  Rosenbluth,  A.  H.  Teller,  and  E.  Teller.  Equa¬ 
tions  of  state  calculation  by  fast  computing  machines.  Journal  of  Chemical  Physics, 
21:1087-1091, 1953. 


138 


BIBLIOGRAPHY 


[MV88] 

[NS60J 

[PTH81] 

[PW60] 

[Ser77] 

[SJ87] 

[Sta73] 

[Sta86] 

[Tho86] 

[Val79] 


M.  Mihail  and  V.  Varirani.  On  the  magnification  of  0-1  polytopes.  Unpublished 
manuscript.,  1988. 

D.  J.  Newman  and  L.  Shepp.  The  double  Dixie  cup  problem.  American  Mathemati¬ 
cal  Monthly ,  67(1):58-61,  1960.  Analyses  the  expected  number  of  coupons  required  in 
multiple  coupon  collecting,  when  a  constant  number  of  complete  sets  is  desired. 

M.  Pagano  and  K.  Taylor- Hal  vorsen.  An  algorithm  for  finding  the  exact  significance  levels 
of  r  x  c  contingency  tables.  Journal  of  the  American  Statistical  Association,  76(376):931- 
4,  1981. 

L.E.  Payne  and  H.  F.  Weinberger.  An  optimal  Poincare  inequality  for  convex  domains. 
Arch.  Rational  Mech.  Anal,  5:286-292,  1960. 

J.-P.  Serre.  Linear  Representations  of  Finite  Groups.  Springer- Verlag,  1977.  Translation 
of  the  French  edition  of  1971. 

A.  Sinclair  and  M.  Jerrum.  Approximate  counting,  uniform  generation,  and  rapidly 
mixing  Markov  chains.  Technical  Report  CSR-241-87,  University  of  Edinburgh,  Dept, 
of  Computer  Science,  1987. 

R.  P.  Stanley.  Linear  homogeneous  Diophantine  equations  and  magic  labelings  of  graphs. 
Duke  Math  Journal,  40:607-632,  1973. 

R.  P.  Stanley.  Enumerative  Combinatorics.  Wadsworth  and  Brooks/Cole,  Monterey, 
California.,  1986.  Only  the  first  volume  has  appeared. 

H.  Thorisson.  On  maximal  and  distributional  coupling.  Annah  of  Probability,  14:874- 
876,  1986. 

L.  G.  Valiant.  The  complexity  of  computing  the  permanent.  Theoretical  Computer 
Science,  8:189-201,  1979. 


